<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Alex Merced&apos;s Lakehouse Blog</title><description>The technical reference for Apache Iceberg and lakehouse catalogs.</description><link>https://iceberglakehouse.com/</link><item><title>The Complete Guide to Agentic Coding Tools in 2026</title><link>https://iceberglakehouse.com/posts/agentic-coding-tools/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/agentic-coding-tools/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-agentic-coding-tools/).

Agentic...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-agentic-coding-tools/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Agentic coding tools have matured into four distinct categories that serve different developer workflows: CLI agents for terminal-first users, desktop IDEs for visual editing, 24/7 autonomous agents for async delegation, and model routers for intelligent resource allocation. That is the useful lens for agentic coding tools in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;The terminal is back. Not the green phosphor CRT kind, but the ethos. In 2026, the most interesting work in developer tooling happens at a command prompt, inside an IDE panel, or through a chat app you already have open. Agentic coding tools have exploded from a handful of experimental projects into a full ecosystem with hundreds of options, billions of API calls per month, and a pace of change that makes last year&apos;s roundups feel like ancient history.&lt;/p&gt;
&lt;p&gt;I track this space obsessively, across four distinct categories. Each solves a different problem. Each has its own tradeoffs. Here is the breakdown.&lt;/p&gt;
&lt;h2&gt;Coding CLI Agents: The Terminal Renaissance&lt;/h2&gt;
&lt;p&gt;The command line never went away, but it spent a decade playing second fiddle to graphical IDEs. That changed in late 2024 when Claude Code (then Claude Engineer) showed what a terminal-native agent could do. Now CLI coding agents are the fastest-growing segment of the developer tools market.&lt;/p&gt;
&lt;h3&gt;Claude Code&lt;/h3&gt;
&lt;p&gt;Anthropic&apos;s flagship coding agent runs entirely in your terminal. It reads your repo, writes files, runs shell commands, manages git branches, and opens pull requests. The 1 million token context window lets it hold your entire codebase in memory. Claude Code uses roughly 5.5 times fewer tokens than equivalent Cursor sessions, which matters when you are paying per token.&lt;/p&gt;
&lt;p&gt;It scores 80.9 percent on SWE-Bench Verified, the highest of any publicly available agent. The hook and plugin system lets you wire in custom validators, linters, or deployment scripts that fire before every commit. It costs $17-20 a month on Pro or $100-200 on Max. The caveat: you are locked into Claude. You cannot swap in another model.&lt;/p&gt;
&lt;h3&gt;OpenCode&lt;/h3&gt;
&lt;p&gt;With over 140,000 GitHub stars, OpenCode is the open-source alternative that refuses to be ignored. It supports 75-plus LLM providers through a unified adapter layer. Want Claude for reasoning and a local Qwen model for quick edits? OpenCode handles that. It runs multi-session workflows, has a plugin system called &amp;quot;SLIM,&amp;quot; and operates locally so your code never touches a server unless you want it to.&lt;/p&gt;
&lt;p&gt;The project moves fast, and that speed comes with occasional breakage. But for developers who want maximum model flexibility without vendor lock-in, OpenCode is the default choice.&lt;/p&gt;
&lt;h3&gt;OpenAI Codex CLI&lt;/h3&gt;
&lt;p&gt;Codex returned in 2025 as a lightweight, local-first agent tied to your ChatGPT subscription. It authenticates through your existing OpenAI account, so there is no separate billing. The cloud sandbox execution mode runs code in ephemeral environments, and the autonomous agent mode can work through multi-step tasks without handholding.&lt;/p&gt;
&lt;p&gt;Codex has extensions for VS Code, Cursor, and Windsurf, making it a hybrid between pure CLI and IDE integration. Its biggest weakness is model lock-in. You need GPT-5 series models to use it, though that also means you get the latest OpenAI capabilities the day they ship.&lt;/p&gt;
&lt;h3&gt;Aider&lt;/h3&gt;
&lt;p&gt;Aider is the veteran of the category, with 39,000 GitHub stars, 4.1 million installations, and 15 billion tokens processed per week. It auto-commits to git with sensible commit messages, works with over 100 languages, and supports Claude, GPT, DeepSeek, and local models via Ollama.&lt;/p&gt;
&lt;p&gt;The voice-to-code feature is surprisingly useful. Dictating &amp;quot;refactor this function to use async/await&amp;quot; while scrolling through code feels faster than typing it. Aider remains the gold standard for terminal pair programming, and it is completely free and open source.&lt;/p&gt;
&lt;h3&gt;Pi&lt;/h3&gt;
&lt;p&gt;Pi (pi.dev) positions itself as a security-first terminal agent. It runs in a sandboxed environment with granular file system permissions. Every tool call must be explicitly approved unless you configure trust rules. Pi is built for teams that need compliance without sacrificing agent capability.&lt;/p&gt;
&lt;p&gt;It supports multi-turn autonomous sessions, can browse the web, read documentation, and execute code in isolated containers. The tradeoff is speed. Approval toggles add friction compared to fully autonomous agents like Claude Code.&lt;/p&gt;
&lt;h3&gt;Goose&lt;/h3&gt;
&lt;p&gt;Goose started as an internal tool at Block (Square) and open-sourced under Apache 2.0. It transitioned to foundation governance under the Linux Foundation&apos;s Agentic AI initiative in early 2026, which gives it a neutrality that other projects lack.&lt;/p&gt;
&lt;p&gt;Goose is MCP-extensible, meaning any tool that speaks the Model Context Protocol can plug into it. It runs full development workflows: plan, code, test, commit, and is genuinely model-agnostic. The desktop companion app gives you a GUI without losing the CLI&apos;s power.&lt;/p&gt;
&lt;h3&gt;Gemini CLI&lt;/h3&gt;
&lt;p&gt;Google&apos;s entry is open source and offers the most generous free tier in the category: 1,000 requests per day with a Google account. That is effectively unlimited for most developers. The 1 million token context window matches Claude Code, and built-in web search grounding lets the agent pull documentation live.&lt;/p&gt;
&lt;p&gt;Gemini CLI supports conversation checkpointing, so you can pause a session and resume it later. The model router automatically picks Gemini 2.5 Pro for complex reasoning and Gemini 2.5 Flash for quick tasks. If Google keeps this free tier, it will be hard to beat for experimentation and learning.&lt;/p&gt;
&lt;h3&gt;GitHub Copilot CLI&lt;/h3&gt;
&lt;p&gt;The GitHub Copilot CLI emerged from public preview in 2026 and integrates deeply with the GitHub ecosystem. It references issues, browses pull requests, manages repos, and supports MCP tools. The default model is Claude Sonnet 4.5, but you can switch to GPT-5.&lt;/p&gt;
&lt;p&gt;The free tier gives 50 premium requests per month. Full access requires a Copilot subscription at $10-39 per seat. For teams already living inside GitHub, the integration is unmatched. For everyone else, the model flexibility of OpenCode or the cost of Gemini CLI looks better.&lt;/p&gt;
&lt;h3&gt;Amp&lt;/h3&gt;
&lt;p&gt;Sourcegraph&apos;s Amp offers a &amp;quot;deep mode&amp;quot; that uses GPT-5.2-Codex for extended autonomous research and implementation. It has composable subagents: Oracle for code analysis, Librarian for external library research, and Painter for image generation.&lt;/p&gt;
&lt;p&gt;The pricing is unusual. Amp is free, ad-supported, with a $10 per day API cost cap. Sourcegraph claims they add no markup on API costs, which makes Amp one of the most transparently priced tools on the market.&lt;/p&gt;
&lt;h3&gt;Warp&lt;/h3&gt;
&lt;p&gt;Warp is a full terminal replacement written in Rust with GPU acceleration. It runs multiple agents simultaneously: you can have Claude Code, Codex, and Gemini CLI all working in split panes. The built-in file editor and code review panel eliminate the need to alt-tab to an IDE.&lt;/p&gt;
&lt;p&gt;Warp claims its agent ships over 50 percent of its own pull requests. The WARP.md project configuration file lets you define project-specific agent behaviors. It is the right tool for developers who basically live in their terminal and want an all-in-one environment.&lt;/p&gt;
&lt;h3&gt;Augment CLI&lt;/h3&gt;
&lt;p&gt;Augment&apos;s enterprise context engine indexes your entire codebase: source code, dependencies, architecture, git history, even Slack threads about the code. The CLI agent uses this context to produce more accurate changes with fewer hallucinated imports.&lt;/p&gt;
&lt;p&gt;Augment scored first on SWE-Bench Pro and counts MongoDB, Spotify, and Webflow as customers. It is the most expensive option in this category, but for large codebases where context quality determines success, the cost is justified.&lt;/p&gt;
&lt;h3&gt;Roo Code / Kilo Code&lt;/h3&gt;
&lt;p&gt;Roo Code (formerly Roo Cline) and Kilo Code (formerly Kilocode) are both VS Code extensions that function as standalone CLI agents. Roo Code has a reputation for reliability on large multi-file changes -- &amp;quot;when other agents break down, use Roo&amp;quot; is a common sentiment.&lt;/p&gt;
&lt;p&gt;Kilo Code supports 500-plus models across 60-plus providers, has an orchestrator mode that breaks complex tasks into subagent workflows, and offers full transparency by showing every token and cost in real time. Both operate on pay-as-you-go pricing.&lt;/p&gt;
&lt;h3&gt;Crush&lt;/h3&gt;
&lt;p&gt;Crush runs on the Charm license and differentiates itself through cross-platform support that includes Android. You can run a coding agent on your phone. Mid-session model switching lets you start with an expensive reasoning model and swap to a cheaper execution model for the mechanical parts of the task. Granular permissions control which files and commands each session can access.&lt;/p&gt;
&lt;h3&gt;Kimi Code CLI&lt;/h3&gt;
&lt;p&gt;Moonshot AI&apos;s entry into the CLI agent category uses the Kimi K2.5 model, which achieves 84.34 percent on MMMU (beating Claude Opus 4.6 on multimodal reasoning). The CLI supports 100-agent swarm capability, meaning you can spin up a hundred agents to work on different parts of a codebase in parallel. This is overkill for most projects, but for massive refactors, it is something no other CLI agent offers.&lt;/p&gt;
&lt;h3&gt;Forge Code&lt;/h3&gt;
&lt;p&gt;Forge Code is a relative newcomer that focuses on agentic CI/CD pipelines. It generates code directly inside your GitHub Actions or GitLab CI workflows. When a test fails, Forge Code analyzes the failure, writes a fix, runs tests again, and commits the fix if everything passes. It is the only CLI agent designed to run inside CI rather than on your local machine.&lt;/p&gt;
&lt;h3&gt;Qwen Code&lt;/h3&gt;
&lt;p&gt;Alibaba&apos;s Qwen Code offers a completely free API, which is remarkable for a tool that scores around 70.6 percent on SWE-Bench. The 1 million token context window matches Claude Code. The catch is availability -- the free API has rate limits, and while Alibaba is clearly subsidizing it for market share, nobody knows how long that will last. For experimentation and learning, it is unbeatable value.&lt;/p&gt;
&lt;h3&gt;T3 Code&lt;/h3&gt;
&lt;p&gt;T3 Code is the free, open-source agent built on the T3 stack philosophy. It is designed for developers who want a working agent without paying for API keys or subscriptions. The tradeoff is that it defaults to local models, which means slower responses and lower capability compared to cloud-backed agents. For solo developers on a budget, T3 Code is worth a look.&lt;/p&gt;
&lt;h3&gt;iFlow&lt;/h3&gt;
&lt;p&gt;iFlow is a CLI agent built around the concept of SubAgents with controlled file permissions. You define which parts of your filesystem each subagent can read and write. This makes it suitable for monorepos where you want agents working on different packages to stay in their lanes. The permission system is more granular than anything in the category except Pi.&lt;/p&gt;
&lt;h3&gt;Amazon Q Developer CLI&lt;/h3&gt;
&lt;p&gt;Amazon Q Developer offers a free tier that is generous for AWS-heavy workflows. The CLI agent understands AWS services natively and can generate infrastructure code, debug Lambda functions, and query CloudWatch logs without you needing to context-switch. Outside of AWS, it is competent but not best-in-class.&lt;/p&gt;
&lt;h2&gt;UI-Based Tools: Desktop IDEs and Apps&lt;/h2&gt;
&lt;p&gt;Not everyone wants to live in the terminal. The desktop IDE category has evolved from autocomplete copilots into full agentic platforms that can build features from scratch, run tests, deploy, and even debug production issues.&lt;/p&gt;
&lt;h3&gt;Cursor&lt;/h3&gt;
&lt;p&gt;Cursor remains the most popular AI-first IDE. Its tab completion quality is still the best in the industry, and the February 2026 update added Computer Use, letting agents control the desktop and browser for GUI testing. The background agent mode spins up an isolated Ubuntu VM, clones your repo, and works on a dedicated branch.&lt;/p&gt;
&lt;p&gt;A typical pull request costs around $4-5 in background agent compute. Cursor priced at $16 per month for the base plan. The community is enormous, which means more tutorials, more extensions, and more people to ask when something breaks.&lt;/p&gt;
&lt;h3&gt;Windsurf&lt;/h3&gt;
&lt;p&gt;Windsurf introduced &amp;quot;Flows,&amp;quot; a persistent context mechanism that keeps the agent aware of your work across sessions. Unlike Cursor, which starts fresh each time, Windsurf remembers what you were working on, what decisions you made, and why you made them.&lt;/p&gt;
&lt;p&gt;The price increased from $15 to $20 per month in March 2026, which caused some grumbling. Windsurf still offers the best continuous context experience, and its multi-model support lets you pick the best model for each task.&lt;/p&gt;
&lt;h3&gt;Antigravity&lt;/h3&gt;
&lt;p&gt;Google&apos;s Antigravity IDE takes a different approach. Instead of a single agent, it spawns parallel agents that work on different parts of the codebase simultaneously. One agent implements the API endpoint while another writes the tests and a third updates the documentation.&lt;/p&gt;
&lt;p&gt;Antigravity includes a built-in Chrome instance for testing, which means the agent can visually verify UI changes without human intervention. The Pro tier costs $20 per month, and Ultra with unlimited parallel agents runs $250. It is the most ambitious IDE in the market, and it shows.&lt;/p&gt;
&lt;h3&gt;Claude Desktop&lt;/h3&gt;
&lt;p&gt;Anthropic&apos;s desktop app wraps Claude Code in a graphical interface. You get the same 1 million token context, the same agent capabilities, and the same model, but with a GUI that shows file diffs, session history, and tool outputs in a readable format.&lt;/p&gt;
&lt;p&gt;Claude Desktop includes &lt;strong&gt;Dispatch&lt;/strong&gt;, a feature that lets you hand off long-running tasks to run in the background while you keep working. You tell Dispatch what needs done, and Claude picks up from where it left off whenever you reopen the app. It is not quite a 24/7 agent, but it is the closest thing to one that runs on your local machine. Close the laptop, reopen it later, and Dispatch resumes the task without you needing to re-explain anything.&lt;/p&gt;
&lt;p&gt;Claude Desktop integrates with your local file system and runs code directly on your machine. It is simpler than Cursor or Windsurf, but that simplicity is the point. You do not need to learn a new IDE to use it.&lt;/p&gt;
&lt;h3&gt;Codex Desktop&lt;/h3&gt;
&lt;p&gt;OpenAI&apos;s desktop application mirrors Claude Desktop but for the GPT-5 series models. It runs on macOS and Windows and lets non-engineers dispatch coding tasks through a chat interface. The cloud sandbox executes code remotely, so you do not need a development environment.&lt;/p&gt;
&lt;p&gt;Codex Desktop has its own version of background execution. You can kick off a task -- refactor a module, add tests, update documentation -- and switch to other work while the agent keeps running in the cloud. The results appear as a pull request when done. Combined with the ChatGPT Pro subscription, this makes Codex Desktop a strong contender for teams that want async coding without managing infrastructure.&lt;/p&gt;
&lt;h3&gt;GitHub Copilot in VS Code&lt;/h3&gt;
&lt;p&gt;Microsoft&apos;s Copilot evolved from autocomplete into a full coding agent inside VS Code. The &amp;quot;Agent Mode&amp;quot; can create files, edit code, run terminal commands, and fix linter errors without switching context. It supports multiple models including Claude Sonnet 4.5 and GPT-5.&lt;/p&gt;
&lt;p&gt;Copilot is the default choice for millions of VS Code users because it ships with the editor. No separate install, no new IDE to learn. The weakness is that it trails purpose-built tools like Cursor on complex multi-file refactors.&lt;/p&gt;
&lt;h3&gt;Continue.dev&lt;/h3&gt;
&lt;p&gt;Continue is the open-source IDE extension that works with both VS Code and JetBrains. With 26,000 GitHub stars, it is the only tool in this category with full cross-editor support. You bring your own models: local via Ollama, cloud via any provider, or a mix of both.&lt;/p&gt;
&lt;p&gt;The tab completion quality is improving, and the slash command system lets you define custom workflows. Continue is not as polished as Cursor, but it is the most flexible option for developers who refuse to switch editors.&lt;/p&gt;
&lt;h3&gt;Cline (VS Code Extension)&lt;/h3&gt;
&lt;p&gt;Cline is the most installed open-source coding extension with 5 million downloads. It operates on a human-in-the-loop model: every file change, terminal command, or browser action requires explicit approval. This sounds slow, but for production codebases, the safety net is worth the friction.&lt;/p&gt;
&lt;p&gt;Cline supports browser automation, checkpoint rollback (undo any agent action), and MCP tools. The checkpoints feature alone has saved me from regenerating files that an overeager agent mangles.&lt;/p&gt;
&lt;h3&gt;Kiro (Amazon)&lt;/h3&gt;
&lt;p&gt;Amazon&apos;s Kiro takes a spec-driven development approach. Before it writes any code, it converts your prompt into EARS notation requirements. The agent then implements against those requirements, creating an auditable trail from request to implementation.&lt;/p&gt;
&lt;p&gt;Kiro has agent hooks that automate follow-ups: run tests on save, deploy on green, rollback on red. The free tier is generous, and the per-prompt credit pricing means you only pay for what you use.&lt;/p&gt;
&lt;h3&gt;Zed&lt;/h3&gt;
&lt;p&gt;Zed is a Rust-native editor that prioritizes speed above everything else. It launches instantly, renders at 120 frames per second, and its AI features are woven into the editor rather than bolted on as an extension. The inline diffs and multi-cursor editing are the best in the business.&lt;/p&gt;
&lt;p&gt;Zed supports Claude, GPT, and local models. It is the fastest editor in the category, but its smaller community means fewer plugins and integrations. If raw speed matters more than ecosystem size, Zed wins.&lt;/p&gt;
&lt;h3&gt;Replit Agent&lt;/h3&gt;
&lt;p&gt;Replit&apos;s agent works entirely in the browser. You describe what you want to build, and the agent creates files, installs dependencies, configures hosting, and deploys. It is the only tool on this list that does not require a local development environment.&lt;/p&gt;
&lt;p&gt;The agent handles deployment automatically, which makes it the best option for prototyping and MVP building. It is less suited for complex production codebases where you need fine-grained control over infrastructure.&lt;/p&gt;
&lt;h3&gt;Mistral Vibe&lt;/h3&gt;
&lt;p&gt;Mistral&apos;s entry into the desktop IDE category uses their Devstral 2 model, which scored 77 percent on SWE-Bench when running autonomously. The source code is Apache 2.0 licensed, so you can inspect and modify it. Paid plans start at $15 per month through Le Chat Pro.&lt;/p&gt;
&lt;p&gt;Devstral 2 is a 123-billion-parameter dense transformer specialized for agentic coding. It is one of the few coding models that performs as well in local deployment as in cloud, which matters for teams with privacy requirements.&lt;/p&gt;
&lt;h3&gt;Tabnine&lt;/h3&gt;
&lt;p&gt;Tabnine predates the current agentic coding wave and has evolved from a completion engine into a full agent. It supports context-aware code generation across your entire project, not just the file you are editing. Tabnine can run fully offline if you use its self-hosted models, and enterprise deployments get code that never leaves your infrastructure.&lt;/p&gt;
&lt;p&gt;The completions are fast, often faster than Cursor&apos;s, but the agent mode is less capable than newer tools. For teams that value privacy above all else, Tabnine is still the strongest option.&lt;/p&gt;
&lt;h3&gt;Codeium (Windsurf base)&lt;/h3&gt;
&lt;p&gt;Codeium was the company behind Windsurf before rebranding, but the core Codeium platform persists as a separate product for teams that want AI-powered completions without switching IDEs. It supports over 40 IDEs and editors, which is more than any competitor.&lt;/p&gt;
&lt;p&gt;The agent mode is less autonomous than Windsurf or Cursor, but the multi-IDE support makes it the default choice for polyglot teams that use a mix of editors.&lt;/p&gt;
&lt;h3&gt;PearAI&lt;/h3&gt;
&lt;p&gt;PearAI is a fork of VS Code with AI features baked in. It wraps multiple agent backends (Claude Code, Codex, OpenAI) behind a single interface. You pick the backend for each task. The philosophy is that no single model is best for everything, so the tool should let you choose without switching editors.&lt;/p&gt;
&lt;p&gt;The setup is more involved than Cursor because you need API keys for each backend. For developers who already have multiple model subscriptions, PearAI consolidates them without forcing you to pick one.&lt;/p&gt;
&lt;h3&gt;Lovable&lt;/h3&gt;
&lt;p&gt;Lovable (formerly GPT Engineer) targets a different audience. It is designed for non-developers who want to build web applications by describing them in natural language. The agent generates the full application, deploys it, and gives you a URL to share.&lt;/p&gt;
&lt;p&gt;Lovable handles the entire lifecycle from idea to deployment. The generated code is production-quality but generic. You get a working app fast, and customizing it later requires understanding the codebase Lovable generated.&lt;/p&gt;
&lt;h3&gt;Bolt.new&lt;/h3&gt;
&lt;p&gt;StackBlitz&apos;s Bolt.new runs entirely in the browser. You describe an application, and Bolt.new creates files, installs dependencies, and deploys to a preview URL, all inside a web container. No local setup, no IDE download.&lt;/p&gt;
&lt;p&gt;Bolt.new is the fastest way to go from idea to running prototype. It is not designed for existing codebases or enterprise projects, but for validating an idea in minutes, nothing else comes close.&lt;/p&gt;
&lt;h3&gt;v0 by Vercel&lt;/h3&gt;
&lt;p&gt;Vercel&apos;s v0 started as a UI generation tool and expanded into full-stack application generation. You describe a component or page, and v0 generates React/Next.js code with Tailwind styling. The agent mode can create multi-page applications with routing and data fetching.&lt;/p&gt;
&lt;p&gt;v0 is optimized for the Vercel ecosystem. If you deploy on Vercel and use Next.js, the generated code integrates naturally. Outside that stack, some features break.&lt;/p&gt;
&lt;h3&gt;Galileo&lt;/h3&gt;
&lt;p&gt;Galileo is unique in this category because it is built for data scientists and ML engineers rather than application developers. It generates Python data pipelines, visualization code, and ML training scripts. The agent understands pandas, NumPy, scikit-learn, PyTorch, and Jupyter notebooks.&lt;/p&gt;
&lt;p&gt;Galileo can execute code inline and display charts and tables in the chat interface. For data teams, it fills a gap that general-purpose coding agents handle poorly.&lt;/p&gt;
&lt;h2&gt;24/7 Autonomous Agents: Your Codebase Never Sleeps&lt;/h2&gt;
&lt;p&gt;The most interesting shift in 2026 is the move from interactive pair programming to asynchronous delegation. These agents live in your chat apps, accept tasks while you are away, and deliver results when you check back.&lt;/p&gt;
&lt;h3&gt;OpenClaw&lt;/h3&gt;
&lt;p&gt;OpenClaw is the largest open-source agent runtime by adoption with 369,000 GitHub stars and 3.2 million active users. It runs on Node.js, bridges 7-plus messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, WeChat), and routes tasks to any LLM backend.&lt;/p&gt;
&lt;p&gt;The sub-agent orchestration via the Agent Client Protocol (ACP) lets OpenClaw dispatch coding work to Claude Code, Codex CLI, or Cursor as sub-agents. The ClawHub marketplace has 44,000 community skills. Need an agent that monitors your AWS bill and DMs you when costs spike? There is a skill for that.&lt;/p&gt;
&lt;p&gt;OpenClaw runs on a single &lt;code&gt;npx openclaw&lt;/code&gt; command or a DigitalOcean one-click droplet for about $24 per month. The ecosystem includes KiloClaw ($49 per month managed hosting), NemoClaw (NVIDIA enterprise container), and ZeroClaw (Rust reimplementation for performance).&lt;/p&gt;
&lt;p&gt;The weakness is that self-hosting carries operational burden, and skill quality in the marketplace varies widely. For a non-profit project with no corporate backer (the creator joined OpenAI in February 2026), the momentum is remarkable.&lt;/p&gt;
&lt;h3&gt;Hermes Agent&lt;/h3&gt;
&lt;p&gt;Hermes Agent from Nous Research launched in February 2026 and grew to 64,000 GitHub stars in three months. It is a Python-based, self-improving agent harness. Every time it solves a problem, it generates a skill document so it can reuse that approach later without being told.&lt;/p&gt;
&lt;p&gt;The persistent cross-session memory uses FTS5 session search and LLM-curated memory with periodic nudges. Hermes connects to Telegram, Discord, Slack, WhatsApp, and Signal. It runs on local, Docker, SSH, Singularity, Modal, Daytona, and Vercel Sandbox.&lt;/p&gt;
&lt;p&gt;What sets Hermes apart is the learning loop. It builds a deep profile of your preferences and work patterns using Honcho dialectic user modeling. Over time, it gets better at predicting what you want before you ask. The built-in &lt;code&gt;hermes claw migrate&lt;/code&gt; tool lets you import configs from OpenClaw, which has made the two projects more complementary than competitive.&lt;/p&gt;
&lt;h3&gt;NemoClaw&lt;/h3&gt;
&lt;p&gt;NVIDIA&apos;s enterprise variant of OpenClaw wraps the agent runtime in a hardened container with TensorRT-LLM optimized inference. Multi-GPU support distributes inference across NVIDIA hardware for larger models. Data never leaves your infrastructure.&lt;/p&gt;
&lt;p&gt;NemoClaw is the only option on this list with automatic quantization, batching, and caching built in. It requires NVIDIA GPUs, which limits adoption, but for organizations that already run on NVIDIA hardware, the inference performance is unmatched.&lt;/p&gt;
&lt;h3&gt;KiloClaw&lt;/h3&gt;
&lt;p&gt;KiloClaw is the managed hosting layer for OpenClaw at $49 per month. It handles the deployment, monitoring, and updates so you do not have to maintain the infrastructure yourself. The value proposition is simple: OpenClaw&apos;s capabilities without the operations overhead.&lt;/p&gt;
&lt;p&gt;For teams that want OpenClaw&apos;s integration breadth but lack the DevOps bandwidth, KiloClaw is the bridge. Fifty dollars per month for a fully managed agent gateway is cheap compared to the engineering time needed to self-host.&lt;/p&gt;
&lt;h3&gt;AutoGen (Microsoft)&lt;/h3&gt;
&lt;p&gt;Microsoft&apos;s AutoGen framework takes a different approach. Instead of a single agent runtime, it is a multi-agent conversation framework where specialized agents collaborate on tasks. You define agents with different roles, tools, and models, and AutoGen manages the conversation flow between them.&lt;/p&gt;
&lt;p&gt;AutoGen is less turnkey than OpenClaw or Hermes. You write code to define agent behavior. But for complex workflows where different agents need different capabilities, it offers the most flexibility. The ecosystem includes templates for common patterns: code generation agent plus review agent plus test agent.&lt;/p&gt;
&lt;h3&gt;CrewAI&lt;/h3&gt;
&lt;p&gt;CrewAI is similar to AutoGen but opinionated toward role-based agent crews. You define a crew with a manager and workers, each with specific responsibilities and tools. The manager agent decomposes tasks and assigns them to workers.&lt;/p&gt;
&lt;p&gt;CrewAI is easier to get started with than AutoGen because the role abstraction maps naturally to how teams think about work. The tradeoff is less control over conversation dynamics. For straightforward delegation patterns, CrewAI is the better choice.&lt;/p&gt;
&lt;h3&gt;LangGraph Agents&lt;/h3&gt;
&lt;p&gt;LangChain&apos;s LangGraph framework adds structured workflow graphs to autonomous agents. Instead of letting the agent figure out the sequence of steps, you define a graph of nodes (tasks) and edges (transitions). The agent navigates the graph, executing nodes and deciding which path to take based on results.&lt;/p&gt;
&lt;p&gt;LangGraph shines for workflows where certain steps must happen in order. A code generation workflow might have: plan, implement, test, review, deploy. Each phase has different tools and success criteria. The graph structure enforces the sequence without hardcoding logic.&lt;/p&gt;
&lt;h3&gt;Paperclip Agent&lt;/h3&gt;
&lt;p&gt;Paperclip is a newer entrant focused on single-purpose autonomous agents. Instead of building a general-purpose agent that can do anything, Paperclip lets you spawn specialized agents for specific tasks: a PR reviewer agent, a dependency update agent, a documentation sync agent.&lt;/p&gt;
&lt;p&gt;Each Paperclip agent runs on its own schedule, monitors its trigger conditions, and executes only its designated function. The architecture keeps agents simple and reliable. If a PR reviewer agent breaks, the dependency updater keeps running. Paperclip is the microservices approach to agent architecture.&lt;/p&gt;
&lt;h3&gt;Claude Code Channels&lt;/h3&gt;
&lt;p&gt;Anthropic&apos;s research preview extends Claude Code into messaging platforms via MCP plugins. Your Claude Code agent lives in Telegram, Discord, or iMessage and executes code on your local development machine. It inherits all Claude Code features: skills, agents, MCP tools, and the full 1 million token context.&lt;/p&gt;
&lt;p&gt;Code Channels requires Anthropic Max ($100-200 per month). The agent stops if Claude Code stops, so it is session-bound rather than truly 24/7. But for developers who already pay for Claude and want mobile access to their coding agent, it fills a specific gap.&lt;/p&gt;
&lt;h3&gt;Devin&lt;/h3&gt;
&lt;p&gt;Cognition&apos;s Devin was the first &amp;quot;AI software engineer&amp;quot; to capture mainstream attention, and it has matured into a production tool used by Goldman Sachs in a hybrid workforce model of 12,000 human developers plus agents.&lt;/p&gt;
&lt;p&gt;Devin spins up a full cloud VM with browser, terminal, and editor. You assign tasks via Slack or web UI, and Devin delivers a pull request with tests and documentation. The pricing is $20 per month for Core plus ACU compute at $9 per hour of active work. The team plan runs $500 per month with 250 ACUs.&lt;/p&gt;
&lt;p&gt;Devin is the most polished cloud agent, but it is also the most expensive for heavy usage. The code leaves your infrastructure, which is a blocker for some enterprises.&lt;/p&gt;
&lt;h3&gt;Cursor Background Agents&lt;/h3&gt;
&lt;p&gt;Cursor&apos;s background agent mode uses an isolated Ubuntu VM that clones your repo and works on an &lt;code&gt;agent/&lt;/code&gt; branch. The February 2026 upgrade added Computer Use, letting the agent test GUI changes by controlling a desktop environment.&lt;/p&gt;
&lt;p&gt;Multiple agents can work in parallel, and a typical pull request costs around $4-5 in compute. The downside is that it is tied to Cursor IDE, so you need to run Cursor for background agents to function.&lt;/p&gt;
&lt;h3&gt;GitHub Copilot Coding Agent&lt;/h3&gt;
&lt;p&gt;The Copilot Coding Agent works directly from GitHub issues. You assign an issue, and the agent creates a branch, implements the feature, writes tests, and opens a pull request. No context switching, no explanation needed.&lt;/p&gt;
&lt;p&gt;Pricing runs $10-39 per seat per month depending on the plan. GitHub is switching to usage-based billing in June 2026, which will change the cost calculus. The agent works best for well-scoped issues like bug fixes, tests, and documentation. Complex architectural changes still need human guidance.&lt;/p&gt;
&lt;h3&gt;Jules (Google)&lt;/h3&gt;
&lt;p&gt;Google&apos;s Jules runs on Gemini 2.5 Pro and integrates with GitHub. It clones your repository into Google Cloud VMs, implements changes, and opens pull requests. While in free preview, it has no production dependency guarantee yet.&lt;/p&gt;
&lt;p&gt;Jules is the most generous cloud agent in terms of cost, but it is also the least mature. The Gemini-powered reasoning is strong, and the free tier makes it worth trying. Relying on it for production work is premature.&lt;/p&gt;
&lt;h3&gt;OpenAI Codex Cloud Agents&lt;/h3&gt;
&lt;p&gt;Beyond the CLI version, OpenAI runs cloud-hosted agents inside sandboxed environments via ChatGPT or the API. Token-based pricing at $1.50 per million input tokens and $6 per million output tokens through the &lt;code&gt;codex-mini-latest&lt;/code&gt; model.&lt;/p&gt;
&lt;p&gt;Codex cloud agents support multi-agent runs and can handle long autonomous sessions. The desktop app (macOS and Windows) wraps these capabilities in a GUI. For teams already in the OpenAI ecosystem, this is the most natural extension of their existing workflow.&lt;/p&gt;
&lt;h3&gt;OpenHands&lt;/h3&gt;
&lt;p&gt;OpenHands (formerly OpenDevin) is an open-source platform for autonomous coding agents. It operates in a Docker sandbox with a web interface, terminal, and file explorer. Agents can write code, run commands, browse the web, and interact with APIs.&lt;/p&gt;
&lt;p&gt;The project focuses on reproducibility and safety. Every agent action is logged, containerized, and auditable. It does not have the polish of Devin or the scale of OpenClaw, but for teams that want full control over agent behavior and data, OpenHands is a strong choice.&lt;/p&gt;
&lt;h2&gt;Model Routers: The Plumbing Layer&lt;/h2&gt;
&lt;p&gt;Every agent needs a brain, and the model router is the switchboard that connects agents to the right model at the right time. This category has grown from simple API proxies into intelligent routing systems that optimize for cost, latency, and capability simultaneously.&lt;/p&gt;
&lt;h3&gt;OpenRouter&lt;/h3&gt;
&lt;p&gt;OpenRouter is the most widely used model router with the largest model catalog. It provides one unified API for every major model provider and many smaller ones. You send a request using the OpenAI SDK format, and OpenRouter routes it to the model you specify.&lt;/p&gt;
&lt;p&gt;The v2 &amp;quot;Smart Routing&amp;quot; feature automatically picks the cheapest model that meets your requirements based on capability tags. Semantic caching reuses responses for similar queries, reducing costs by up to 60 percent. OpenRouter handles fallback logic, so if one provider is down, traffic routes to another.&lt;/p&gt;
&lt;p&gt;OpenRouter processed billions of tokens per day as of early 2026. It is the default model router for most open-source agent projects including OpenCode, Hermes, and Cline. The free tier includes access to 27 models with no credit card required.&lt;/p&gt;
&lt;h3&gt;Nous Portal&lt;/h3&gt;
&lt;p&gt;Nous Research&apos;s model gateway is integrated into Hermes Agent and provides access to 200-plus models. It optimizes for agentic workflows specifically: chain-of-thought traces, tool call formatting, and structured output are first-class concerns, not afterthoughts.&lt;/p&gt;
&lt;p&gt;The Portal supports custom endpoint configuration and OpenRouter as a fallback. It is designed for developers who want fine-grained control over model selection for different task types. Complex reasoning routes to expensive models, while file operations use cheaper local models.&lt;/p&gt;
&lt;p&gt;Nous Portal is younger than OpenRouter but growing fast because it ships with Hermes Agent by default. If you run Hermes, you are already using it.&lt;/p&gt;
&lt;h3&gt;OpenCode Zen&lt;/h3&gt;
&lt;p&gt;OpenCode Zen is the model routing layer within the OpenCode ecosystem. It abstracts model selection behind capability profiles. You define what you need: &amp;quot;fast edit&amp;quot; or &amp;quot;deep reasoning&amp;quot; or &amp;quot;code review.&amp;quot; Zen picks the cheapest model that satisfies the profile.&lt;/p&gt;
&lt;p&gt;The SLIM plugin system lets you define custom routing rules. OpenCode Zen also supports multi-model conversations where different turns go to different models. The first turn uses Sonnet for planning, and subsequent turns use a local Qwen model for execution.&lt;/p&gt;
&lt;h3&gt;OpenRouter Smart Routing&lt;/h3&gt;
&lt;p&gt;A separate mention because Smart Routing in OpenRouter v2 deserves its own spotlight. This feature tags models by capability (reasoning, coding, vision, tool use, structured output, long context) and prices. Your request specifies requirements; OpenRouter finds the cheapest combination.&lt;/p&gt;
&lt;p&gt;Smart Routing cuts costs by 30 to 50 percent compared to manual model selection. The tradeoff is predictable latency. The cheapest model for a task is not always the fastest.&lt;/p&gt;
&lt;h3&gt;Portkey&lt;/h3&gt;
&lt;p&gt;Portkey started as an observability layer for LLMs and evolved into a full gateway. It offers caching, fallbacks, rate limiting, and guardrails alongside routing. The observability features include cost tracking, latency monitoring, and failure analysis.&lt;/p&gt;
&lt;p&gt;Portkey is more enterprise-oriented than OpenRouter. It is built for teams that need audit trails, compliance controls, and detailed analytics. The open-source self-hosted version gives you full data control.&lt;/p&gt;
&lt;h3&gt;LiteLLM&lt;/h3&gt;
&lt;p&gt;LiteLLM is the Python-native gateway that supports 100-plus providers through a consistent interface. It is lightweight by design, running as a single Python package or Docker container. The SDK translates between provider-specific formats automatically.&lt;/p&gt;
&lt;p&gt;LiteLLM is the default choice for Python projects that need model routing without adding a dependency on a cloud service. It handles rate limiting, retries, and fallback out of the box.&lt;/p&gt;
&lt;h3&gt;Helix (Kilo Code)&lt;/h3&gt;
&lt;p&gt;Kilo Code&apos;s built-in router, Helix, optimizes for coding agent workflows specifically. It understands which models excel at which coding tasks: code generation, refactoring, debugging, test writing, and routes accordingly.&lt;/p&gt;
&lt;p&gt;Helix supports 500-plus models across 60-plus providers. The real-time cost display shows exactly what each model choice costs per turn, which builds intuition about model economics over time.&lt;/p&gt;
&lt;h3&gt;Amazon Bedrock / Google Vertex AI&lt;/h3&gt;
&lt;p&gt;The cloud provider gateways are not the most exciting routers, but they are the most important for enterprise deployments. Bedrock and Vertex AI provide access to multiple models through a single API with enterprise security, compliance certifications, and SLA guarantees.&lt;/p&gt;
&lt;p&gt;Bedrock supports Anthropic, Meta, Mistral, Cohere, and Amazon&apos;s own models. Vertex AI supports Gemini, Claude, and select open models. They charge no markup on model calls, only infrastructure and gateway fees.&lt;/p&gt;
&lt;h3&gt;Gateway Providers (Kong, Azure API Management, Apigee)&lt;/h3&gt;
&lt;p&gt;For organizations that already use API gateways for their microservices, extending them to LLM routing is a natural step. Kong&apos;s AI Gateway, Azure API Management&apos;s model routing, and Google Apigee all support LLM request routing with the same governance controls applied to regular APIs.&lt;/p&gt;
&lt;p&gt;These tools are not designed for individual developers. They are for platform teams that need to centralize LLM access controls, cost allocation, and compliance across their organization.&lt;/p&gt;
&lt;h3&gt;Custom Routing with LangChain / LlamaIndex&lt;/h3&gt;
&lt;p&gt;Some teams build their own routers using LangChain or LlamaIndex. The advantage is complete control over routing logic. You can implement priority queues, multi-model voting, or progressive escalation where a cheaper model handles the first pass and a more expensive one reviews the output.&lt;/p&gt;
&lt;p&gt;The disadvantage is operational complexity. Running your own router means maintaining your own provider integrations, fallback logic, and cost tracking. For most teams, OpenRouter or LiteLLM is the better starting point.&lt;/p&gt;
&lt;h3&gt;AI Gateway by Portkey&lt;/h3&gt;
&lt;p&gt;Portkey&apos;s AI Gateway deserves a second look because it goes beyond routing into full lifecycle management. It offers caching at multiple levels (semantic, exact, prefix), request-level guardrails that block harmful or off-topic prompts before they reach the model, and usage-based billing controls that prevent budget overruns.&lt;/p&gt;
&lt;p&gt;The enterprise version adds SOC 2 compliance, audit logs, and role-based access control. Portkey is the right choice when your organization needs to govern, not just route, model usage.&lt;/p&gt;
&lt;h3&gt;Helicone&lt;/h3&gt;
&lt;p&gt;Helicone focuses on observability for model routers. It captures every request and response, builds usage dashboards, and alerts on cost spikes or latency degradation. It integrates with OpenRouter, LiteLLM, and custom endpoints through a proxy layer.&lt;/p&gt;
&lt;p&gt;Helicone does not route traffic itself. It sits alongside your router and makes the data visible. For teams that want to understand their model spend before optimizing it, Helicone provides the baseline.&lt;/p&gt;
&lt;h3&gt;OpenRouter Model Rankings&lt;/h3&gt;
&lt;p&gt;OpenRouter publishes monthly model rankings based on actual usage data across its platform. The April 2026 rankings showed MiMo V2 Pro at number one with 4.65 trillion tokens processed, followed by Qwen 3.6 Plus at number three. Xiaomi held 22.3 percent of total market share by model count.&lt;/p&gt;
&lt;p&gt;These rankings matter because they reveal what developers actually use, not what benchmarks say. A model that scores high on SWE-Bench but costs five times the runner-up will not see as much production traffic. The rankings are a reality check against benchmark hype.&lt;/p&gt;
&lt;h3&gt;Multi-Model Routing Strategies&lt;/h3&gt;
&lt;p&gt;Beyond specific tools, the routing strategies themselves deserve attention. The most common pattern in 2026 is tiered routing: a cheap local model handles syntax corrections and quick completions, a mid-tier cloud model handles code generation and refactoring, and an expensive reasoning model only activates for architecture decisions and complex bug diagnosis.&lt;/p&gt;
&lt;p&gt;Another pattern gaining traction is ensemble routing, where two models independently solve the same problem and a third model evaluates both solutions. This catches hallucinations by cross-checking outputs. The token cost doubles or triples, but for safety-critical code, the redundancy is worth it.&lt;/p&gt;
&lt;p&gt;Some teams use router-as-judge patterns where the router itself is a lightweight model that evaluates task complexity and routes accordingly. The router model costs pennies per request and prevents expensive models from being wasted on trivial tasks.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Stack&lt;/h2&gt;
&lt;p&gt;There is no single best agentic coding setup. The right combination depends on your workflow, budget, and tolerance for complexity.&lt;/p&gt;
&lt;p&gt;For terminal purists who want maximum capability per dollar, Claude Code with OpenRouter fallback covers most scenarios. Add Hermes Agent for async background tasks, and you have a setup that handles both interactive coding and unattended maintenance.&lt;/p&gt;
&lt;p&gt;For IDE-first developers, Cursor or Windsurf with Claude Code as the background agent gives you the polished editing experience with Cursor&apos;s tab completions and Claude Code&apos;s reasoning capability when you need deep context.&lt;/p&gt;
&lt;p&gt;For teams that want to delegate entirely, OpenClaw or Hermes Agent connected to Slack or Discord, backed by OpenRouter for model routing, lets your team assign tasks through chat and review pull requests when agents finish.&lt;/p&gt;
&lt;p&gt;The model router matters more than most developers think. The difference between paying full retail for Claude Opus and using OpenRouter&apos;s smart routing is often 40 to 60 percent savings. For heavy users, that savings pays for a router subscription several times over.&lt;/p&gt;
&lt;h2&gt;Tradeoffs and Limitations&lt;/h2&gt;
&lt;p&gt;Every tool in this list has blind spots.&lt;/p&gt;
&lt;p&gt;CLI agents are powerful but remove visual feedback. You cannot easily verify UI changes from a terminal.&lt;/p&gt;
&lt;p&gt;Desktop IDEs offer the best integration but lock you into their ecosystem. Moving from Cursor to Windsurf to Antigravity means learning new workflows each time.&lt;/p&gt;
&lt;p&gt;24/7 agents are asynchronous by nature. You give them a task and come back later. For quick edits, the round trip time is worse than just making the change yourself.&lt;/p&gt;
&lt;p&gt;Model routers add a layer of abstraction that can fail. When OpenRouter is down, every tool downstream stops working. Self-hosted routers like LiteLLM avoid this but add operational overhead.&lt;/p&gt;
&lt;p&gt;None of these tools understand your business context. They can generate syntactically correct code that solves the wrong problem. Code review by a human who understands the domain is not optional.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The agentic coding tool landscape in 2026 is defined by diversity and choice. Four years ago, you had GitHub Copilot completions and not much else. Now you have specialized CLI agents, integrated IDEs, autonomous background workers, and intelligent routing that optimizes every API call.&lt;/p&gt;
&lt;p&gt;Start with one category. If you live in the terminal, try Claude Code or OpenCode. If you prefer a GUI, Cursor or Windsurf. If you want to delegate background work, OpenClaw or Hermes Agent. Connect everything through OpenRouter or LiteLLM for model routing.&lt;/p&gt;
&lt;p&gt;Stick with that stack for a month. See what works, what frustrates you, and what you wish the tools did differently. The ecosystem is moving fast enough that a gap today might be a feature next month. That pace is exciting, but it also means the best setup is the one you actually use.&lt;/p&gt;
&lt;p&gt;If this deep dive got you thinking about how agentic systems fit into the bigger picture of data architecture and AI workflows, I have written extensively on both topics. Check out my books on data architecture and agentic AI at &lt;a href=&quot;http://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Agentic Lakehouse Concurrency and Isolation</title><link>https://iceberglakehouse.com/posts/agentic-lakehouse-concurrency-isolation-contracts/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/agentic-lakehouse-concurrency-isolation-contracts/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-agentic-lakehouse-concurrency-is...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-agentic-lakehouse-concurrency-isolation-contracts/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Agentic writes need isolation contracts, not just write permissions. That is the useful lens for agentic lakehouse concurrency in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/agentic-lakehouse-concurrency-isolation-contracts-diagram-1.png&quot; alt=&quot;agentic lakehouse concurrency architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind agentic lakehouse concurrency&lt;/h2&gt;
&lt;p&gt;A human analyst might run one update after checking a result. A fleet of agents can generate many reads, writes, retries, and corrections in parallel. Iceberg&apos;s optimistic concurrency model helps, but the platform still needs orchestration rules.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Iceberg commits are snapshot-based. Writers prepare metadata changes and commit them if the table state has not changed in a conflicting way.&lt;/p&gt;
&lt;p&gt;Conflict detection protects table consistency, but it does not decide whether an agent&apos;s business action was wise.&lt;/p&gt;
&lt;p&gt;Isolation contracts define what an agent can write, which partitions it can touch, how retries work, and who reviews risky operations.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/agentic-lakehouse-concurrency-isolation-contracts-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;Two pricing agents may try to update recommendations for the same product family. Iceberg can protect the table commit, but orchestration must prevent a retry loop that keeps fighting over the same partition.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prefer append-only agent logs before direct table updates.&lt;/li&gt;
&lt;li&gt;Partition write targets by agent, domain, or time window where possible.&lt;/li&gt;
&lt;li&gt;Use idempotency keys for every agent action.&lt;/li&gt;
&lt;li&gt;Require human approval for writes that affect executive metrics or regulated data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/agentic-lakehouse-concurrency-isolation-contracts-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Optimistic concurrency performs poorly if too many writers target the same files or partitions. Agents need partition discipline and backoff rules.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For agentic lakehouse concurrency, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For agentic lakehouse concurrency, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for agentic lakehouse concurrency from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For agentic lakehouse concurrency, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If agentic lakehouse concurrency is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For agentic lakehouse concurrency, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For agentic lakehouse concurrency, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that agentic lakehouse concurrency helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For agentic lakehouse concurrency, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of agentic lakehouse concurrency that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For agentic lakehouse concurrency, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For agentic lakehouse concurrency, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either agentic lakehouse concurrency has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For agentic lakehouse concurrency, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Anatomy of an Agentic Lakehouse</title><link>https://iceberglakehouse.com/posts/anatomy-agentic-lakehouse-four-layers/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/anatomy-agentic-lakehouse-four-layers/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-anatomy-agentic-lakehouse-four-l...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-anatomy-agentic-lakehouse-four-layers/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;An Agentic Lakehouse is storage, catalog governance, semantic context, and agents working as one operating model. That is the useful lens for agentic lakehouse in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/anatomy-agentic-lakehouse-four-layers-diagram-1.png&quot; alt=&quot;agentic lakehouse architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind agentic lakehouse&lt;/h2&gt;
&lt;p&gt;A chatbot over raw tables is not an Agentic Lakehouse. The architecture has to support discovery, query execution, optimization, policy enforcement, audit, and controlled action.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;The storage layer should use open table formats such as Apache Iceberg.&lt;/p&gt;
&lt;p&gt;The catalog layer should manage table identity, commits, access, and metadata.&lt;/p&gt;
&lt;p&gt;The semantic and agent layers should expose business-approved objects and safe tools.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/anatomy-agentic-lakehouse-four-layers-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A sales agent that investigates pipeline risk should not scan random warehouse tables. It should query certified opportunity views, respect territory permissions, use approved definitions, and write any follow-up action to an auditable system.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with certified semantic views over the most important datasets.&lt;/li&gt;
&lt;li&gt;Add agent tools only after policy and audit are in place.&lt;/li&gt;
&lt;li&gt;Use Reflections and table optimization to keep exploratory loops fast.&lt;/li&gt;
&lt;li&gt;Expand from read-only investigation to controlled action only after validation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/anatomy-agentic-lakehouse-four-layers-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Autonomy without policy just creates faster mistakes. The architecture must make the allowed path easier than the unsafe path.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For agentic lakehouse, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For agentic lakehouse, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for agentic lakehouse from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For agentic lakehouse, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If agentic lakehouse is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For agentic lakehouse, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For agentic lakehouse, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that agentic lakehouse helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For agentic lakehouse, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of agentic lakehouse that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For agentic lakehouse, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For agentic lakehouse, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either agentic lakehouse has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For agentic lakehouse, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
&lt;h2&gt;Test cases that matter&lt;/h2&gt;
&lt;p&gt;Use test cases that reflect real business questions. For agentic lakehouse, include at least one happy path, one denied-access path, one stale-data path, and one rollback path. Those tests reveal more than a generic demo query.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg v4 Roadmap: Adaptive Metadata Trees, Single-File Commits, and the Delta Convergence</title><link>https://iceberglakehouse.com/posts/apache-iceberg-v4-roadmap-adaptive-metadata-delta-convergence/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/apache-iceberg-v4-roadmap-adaptive-metadata-delta-convergence/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-apache-iceberg-v4-roadmap-adapti...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-apache-iceberg-v4-roadmap-adaptive-metadata-delta-convergence/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Apache Iceberg v4 is not a single feature release. It is a set of architectural proposals: adaptive metadata trees, single-file commits, relative table paths, column families, and an extensible statistics model. These proposals rework how Iceberg handles metadata at scale. Separately, Databricks has proposed that &lt;strong&gt;Delta Lake 5.0 adopt the same metadata structure&lt;/strong&gt;, which would end the decade-long schism between the two formats at the metadata level. This article walks through every proposal, the pain points each one solves, the community debates still unresolved, and what teams should do while the spec is still under discussion.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/apache-iceberg-v4-roadmap-adaptive-metadata-delta-convergence-diagram-1.png&quot; alt=&quot;Iceberg v4 metadata architecture evolution&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Problem Iceberg v4 Is Trying to Solve&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s current metadata tree was designed for batch workloads. A table with one million files, multiple partition evolutions, and hundreds of concurrent writers exposes three structural limitations that v4 proposals target.&lt;/p&gt;
&lt;h3&gt;Metadata Write Amplification&lt;/h3&gt;
&lt;p&gt;Every Iceberg commit creates a new metadata file. For a table with thousands of manifests, a single new data file can trigger a commit that rewrites the manifest list and metadata JSON. Under high-frequency writes, like streaming ingestion, CDC pipelines, or agent-generated updates, that amplification makes sub-second commits difficult. The Snowflake engineering team at Iceberg Summit 2026 described the constraint directly: &amp;quot;Iceberg&apos;s metadata tree was built for batch workloads, and its write amplification creates commit latencies that streaming can&apos;t tolerate.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Metadata Planning Overhead&lt;/h3&gt;
&lt;p&gt;Reading a table&apos;s current state requires traversing the metadata file → manifest list → manifest files chain. For tables with hundreds of manifests, the planning step can dominate query time even before reading data. Adaptive metadata trees aim to make this traversal O(1) by inlining partition-level statistics directly into a single metadata structure, eliminating manifest indirection for queries that scan a narrow partition range.&lt;/p&gt;
&lt;h3&gt;Format Fragmentation&lt;/h3&gt;
&lt;p&gt;Iceberg and Delta Lake have converged on similar ideas, like columnar metadata, deletion vectors, and manifest-like tracking, but maintain separate metadata formats. Teams running both formats incur duplicate maintenance tooling, incompatible catalogs, and two sets of operational procedures. Databricks publicly stated at its May 2026 press cycle: &amp;quot;Iceberg v4 and Delta 5.0 will converge on a unified metadata structure, ending the tradeoff between interoperability and production-ready performance.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Adaptive Metadata Trees: The Core of v4&lt;/h2&gt;
&lt;p&gt;The adaptive metadata tree is the centerpiece of the v4 proposals. It restructures Iceberg&apos;s metadata from a multi-level manifest hierarchy into a flatter, tree-structured model where metadata nodes can be split, merged, or relocated dynamically based on table shape and access patterns.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/apache-iceberg-v4-roadmap-adaptive-metadata-delta-convergence-diagram-2.png&quot; alt=&quot;Adaptive metadata tree structure&quot;&gt;&lt;/p&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;p&gt;In the current Iceberg v3 model, metadata is organized as:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;metadata.json → manifest-list.avro → [manifest-1.avro, manifest-2.avro, ...] → [data-files.parquet]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each layer is a separate file. Reading partition statistics for query planning requires traversing the manifest list and opening each manifest.&lt;/p&gt;
&lt;p&gt;In the v4 adaptive tree model, the structure becomes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;root-metadata.parquet → [metadata-node-1.parquet, metadata-node-2.parquet, ...] → [data-files.parquet]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key differences:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Inlined partition statistics.&lt;/strong&gt; Each metadata node contains the column-level min/max/null statistics for the data files it tracks, stored as columnar Parquet data. The query planner can read a single node and determine whether to scan its data files without opening additional manifest files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dynamic node splitting.&lt;/strong&gt; As a table grows, nodes can be split by partition range, column range, or file count. This is the &amp;quot;adaptive&amp;quot; property: the tree reorganizes itself based on what the workload needs, rather than requiring manual partition design or compaction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Single-file commits.&lt;/strong&gt; Instead of writing a metadata JSON + manifest list + manifest files for every commit, v4 proposes writing a single Parquet metadata node that contains all the changes. The Snowflake Iceberg Summit recap (June 2026) confirmed: &amp;quot;V4&apos;s adaptive metadata trees introduce one-file commits which enable low-latency writes without sacrificing read performance on large tables.&amp;quot;&lt;/p&gt;
&lt;h3&gt;What Single-File Commits Mean for Streaming&lt;/h3&gt;
&lt;p&gt;For streaming workloads, the commit latency improvement is the headline benefit. A streaming pipeline writing one file per minute currently creates one full metadata commit cycle per file. With single-file commits, the metadata update is a small Parquet node write followed by an atomic pointer swap in the catalog. The total metadata I/O per commit drops from O(number-of-manifests) to O(1).&lt;/p&gt;
&lt;p&gt;This matters for real-time analytics, CDC pipelines, and any workload where agents or automated processes write data at sub-minute cadence. Teams that previously batch-delayed streaming writes to avoid metadata overhead can now commit each micro-batch independently.&lt;/p&gt;
&lt;h3&gt;Tradeoffs Under Discussion&lt;/h3&gt;
&lt;p&gt;The adaptive tree is not universally accepted. The Apache Iceberg community mailing list has active debates about:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet for metadata.&lt;/strong&gt; Replacing Avro manifests with Parquet metadata nodes changes the memory profile of planning. Parquet is columnar and read-optimized, but reading a single metadata node&apos;s partition stats requires materializing the entire row group. For tables with thousands of columns in the metadata node, this could increase planning memory usage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tree depth tuning.&lt;/strong&gt; If split thresholds are set too aggressively, a table could end up with hundreds of metadata nodes, recreating the same indirection the tree was meant to solve. The adaptive split algorithm needs sensible defaults and operator overrides.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backward compatibility with v3 manifests.&lt;/strong&gt; The proposals need a migration path where existing v3 tables can be gradually upgraded without rewriting all metadata at once. The current design discussion favors a &amp;quot;hybrid mode&amp;quot; where the root metadata references both legacy manifest lists and new adaptive nodes, with a background compaction job converting manifests to nodes over time.&lt;/p&gt;
&lt;h2&gt;Relative Paths: Making Tables Portable&lt;/h2&gt;
&lt;p&gt;A smaller but operationally significant v4 proposal is &lt;strong&gt;relative path support&lt;/strong&gt;. Currently, Iceberg stores absolute file paths in manifest entries:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3://my-bucket/prod/warehouse/orders/partition_date=2026-06-01/00001.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates friction for migration, disaster recovery, and cloud replication. Moving a table to a different bucket or region requires rewriting every manifest file with new paths.&lt;/p&gt;
&lt;h3&gt;The Proposed Solution&lt;/h3&gt;
&lt;p&gt;Store file references relative to the table root:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./partition_date=2026-06-01/00001.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The catalog pointer or table property provides the absolute base path. When the table is cloned, mirrored, or failed over, only the base path changes: the metadata stays valid.&lt;/p&gt;
&lt;p&gt;This is straightforward for new tables. For existing tables, the migration requires either a one-time metadata rewrite (accepted by the community as an operational cost) or a compatibility mode where the metadata stores both an absolute and relative path for each file entry.&lt;/p&gt;
&lt;h3&gt;Why This Matters for Multi-Cloud&lt;/h3&gt;
&lt;p&gt;Teams running Iceberg across AWS and GCS, or replicating tables for disaster recovery, currently maintain separate metadata copies per location. Relative paths eliminate the need for metadata duplication. A table can be copied from us-east-1 to eu-west-2 by copying the data files and updating one base path property, without touching metadata.&lt;/p&gt;
&lt;p&gt;The Databricks engineering team has indicated that Delta 5.0 will adopt the same relative path convention, ensuring that converged metadata trees are portable across clouds regardless of which format&apos;s ecosystem they originated in.&lt;/p&gt;
&lt;h2&gt;Column Families: Solving the Wide-Table Problem&lt;/h2&gt;
&lt;p&gt;Machine learning feature engineering produces tables with thousands of columns. The Snowflake Iceberg Summit recap (June 2026) called this out directly: &amp;quot;ML feature engineering produces tables with thousands of columns, and today&apos;s layout forces full file rewrites for even small updates.&amp;quot;&lt;/p&gt;
&lt;h3&gt;How Column Families Work&lt;/h3&gt;
&lt;p&gt;Column families let the table author group columns into independently stored and versioned sets:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;-- schema definition
CREATE TABLE features (
  uuid STRING,
  -- core columns
  created_at TIMESTAMP,
  -- feature group A: freshness indicators
  family freshness (
    days_since_purchase INT,
    recency_score FLOAT,
    avg_visit_interval FLOAT
  ),
  -- feature group B: behavioral features (refreshed separately)
  family behavioral (
    lifetime_value FLOAT,
    churn_probability FLOAT,
    category_affinity MAP&amp;lt;STRING, FLOAT&amp;gt;
  ),
  -- feature group C: real-time signals (updated every minute)
  family realtime (
    session_active BOOLEAN,
    current_cart_value FLOAT,
    page_velocity INT
  )
) USING iceberg;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each family can be committed independently. Adding a new feature to the behavioral family rewrites only the behavioral Parquet files, not the entire table. Backfilling a new column into a family compacts the affected family&apos;s files without touching the rest of the table.&lt;/p&gt;
&lt;h3&gt;Impact on ML Pipelines&lt;/h3&gt;
&lt;p&gt;Feature stores that manage hundreds of features across training and serving pipelines benefit directly. A feature team can add, modify, or retire features within a family without coordinating compaction windows with other teams. Training pipelines that only read the freshness and behavioral families can skip scanning the realtime family entirely, reducing I/O.&lt;/p&gt;
&lt;p&gt;The column families proposal is further along than the adaptive metadata tree: it has a more complete design document and several community members have expressed intent to implement it once the spec draft is published.&lt;/p&gt;
&lt;h2&gt;Extensible Column Statistics: Making Planning Smarter&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s current per-column statistics model stores min, max, and null count for each column in each manifest entry. V4 proposes rebuilding this model to support pluggable statistics types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Approximate distinct counts&lt;/strong&gt; (HyperLogLog sketches) for optimizer cardinality estimates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bloom filters&lt;/strong&gt; for point-lookup pruning on high-cardinality columns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Histogram bins&lt;/strong&gt; for range-aware predicate evaluation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector embeddings&lt;/strong&gt; for ANN-style similarity search over embedding columns&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The extensible model would let engines register new statistic types that the metadata layer stores and serves during planning. An engine that supports ANN search could register an &lt;code&gt;embedding_summary&lt;/code&gt; statistic that indexes the vector space of an embedding column; the metadata layer would store and return the index structure as part of scan planning.&lt;/p&gt;
&lt;p&gt;This is the most speculative proposal in v4. The core Iceberg committers have asked for production benchmarks showing that the current statistics model is a bottleneck before accepting the complexity of a pluggable system.&lt;/p&gt;
&lt;h2&gt;The Delta 5.0 Convergence&lt;/h2&gt;
&lt;p&gt;The most strategically significant development around Iceberg v4 is not a technical proposal but a competitive alignment. Databricks announced that &lt;strong&gt;Delta Lake 5.0 will adopt the same adaptive metadata tree structure that Iceberg v4 proposes&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;What Convergence Actually Means&lt;/h3&gt;
&lt;p&gt;The formats will remain independent: they will not merge into one spec. Each will keep its own commit protocol, catalog integration, and engine-specific optimizations. But the metadata storage layer will be compatible. A metadata node written by Delta 5.0 can be read by an Iceberg v4 client, and vice versa.&lt;/p&gt;
&lt;p&gt;This eliminates the metadata-level incompatibility that currently forces teams to choose between Iceberg&apos;s broader engine ecosystem and Delta&apos;s tighter performance optimization. A table stored in Iceberg format under a Unity Catalog can have its metadata nodes read and understood by Snowflake&apos;s Iceberg client, Trino, DuckDB, and any other engine that implements the v4 metadata reader.&lt;/p&gt;
&lt;h3&gt;What Remains Different&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Commit protocol.&lt;/strong&gt; Iceberg uses optimistic concurrency with retry; Delta uses a transaction log in the storage layer. These are philosophically different approaches that neither side has proposed converging.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Catalog model.&lt;/strong&gt; Iceberg separates catalog from table format via the REST catalog specification. Delta ties the catalog and format more closely through Unity Catalog&apos;s managed tables. The convergence applies to the metadata file format, not the catalog layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Engine-specific features.&lt;/strong&gt; Liquid clustering, predictive optimization, and materialized views remain Databricks-specific. The adaptive metadata tree gives engines a compatible foundation to read the same metadata; it does not require them to support the same query execution features.&lt;/p&gt;
&lt;h3&gt;Where the Catalog Debate Goes from Here&lt;/h3&gt;
&lt;p&gt;Nidhi Vichare&apos;s &lt;em&gt;Catalog Wars&lt;/em&gt; series (June 2026) makes the point succinctly: &amp;quot;The format question is settled. The catalog question is the one that will define your next decade of optionality.&amp;quot; With the metadata layer converging, competitive differentiation moves to catalogs, governance, semantic layers, and agent interfaces.&lt;/p&gt;
&lt;p&gt;This shift is already visible. Apache Polaris reached top-level project status in February 2026 and v1.4 added storage-scoped AWS credentials, STS session tags, and CockroachDB backend support. Unity Catalog added cross-engine ABAC (row filters and column masks enforced via REST scan planning, working with Spark and DuckDB). The catalog, not the table format, is becoming the control plane for multi-engine governance.&lt;/p&gt;
&lt;h2&gt;Practical Guidance for Teams Evaluating Iceberg v4&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/apache-iceberg-v4-roadmap-adaptive-metadata-delta-convergence-diagram-3.png&quot; alt=&quot;Implementation decision tree&quot;&gt;&lt;/p&gt;
&lt;h3&gt;What to Do Now&lt;/h3&gt;
&lt;p&gt;Iceberg v4 is in the proposal phase. No specification draft has been published. No engine supports any v4 feature in production. The timeline from the community discussions suggests a spec draft by late 2026, an experimental implementation by mid-2027, and production availability by late 2027 at the earliest.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For new tables beginning in mid-2026:&lt;/strong&gt; Design partitioning and file layout with the understanding that v4 metadata migration will be one-directional. Avoid metadata structures that are difficult to convert: tables with extremely deep manifest trees, custom clustering that produces thousands of manifests per partition, or manual partitioning schemes that overlap with v4&apos;s adaptive split algorithm. None of these will break under v3, but they will make the v4 upgrade path more expensive.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For streaming and high-frequency write workloads:&lt;/strong&gt; If your pipeline currently batches writes to avoid metadata overhead, the v4 single-file commit proposal directly addresses that constraint. Design your pipeline to produce independent files per micro-batch today; the metadata commit improvement is a format-layer change that your pipeline can adopt without restructuring.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For ML feature tables:&lt;/strong&gt; The column families proposal is close to spec-ready. If your feature engineering pipeline produces tables with hundreds of columns refreshed on different schedules, begin documenting column groups now. The v4 migration will be easier if you already know which columns belong together.&lt;/p&gt;
&lt;h3&gt;What to Watch&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Apache Iceberg mailing list&lt;/strong&gt; for the v4 spec draft publication. Subscribe to the &lt;code&gt;dev@iceberg.apache.org&lt;/code&gt; list and watch for threads with &amp;quot;[V4]&amp;quot; in the subject.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iceberg Summit 2026 session recordings&lt;/strong&gt;, particularly &amp;quot;Breaking the Mold: Re-thinking Iceberg Metadata Structure in V4&amp;quot; (&lt;a href=&quot;https://youtu.be/ymUCDJV19tE&quot;&gt;watch&lt;/a&gt;) and the closing panel on ecosystem innovations (&lt;a href=&quot;https://youtu.be/szWvGm5busw&quot;&gt;watch&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databricks Data + AI Summit 2026&lt;/strong&gt; session &amp;quot;Delta + Iceberg, Better Together&amp;quot; for the Delta 5.0 convergence details and timeline.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proposed v4 features per the community roadmap&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Adaptive metadata tree: spec draft targeted late 2026&lt;/li&gt;
&lt;li&gt;Single-file commits: bundled with adaptive tree&lt;/li&gt;
&lt;li&gt;Relative paths: independent proposal, could ship earlier&lt;/li&gt;
&lt;li&gt;Column families: could ship as a v3.x extension before v4&lt;/li&gt;
&lt;li&gt;Extensible statistics: earliest viable Q1 2027&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Where Dremio Fits&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s architecture sits above the table format layer. Dremio queries Iceberg tables through the REST catalog protocol, applies semantic views on top of raw table schemas, and accelerates queries with Reflections and its columnar cloud cache. The Iceberg v4 changes benefit Dremio users because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster planning on large tables.&lt;/strong&gt; Adaptive metadata trees reduce the metadata traversal cost for tables with hundreds of manifests, which directly improves query planning latency on Dremio&apos;s execution engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cleaner replication and migration.&lt;/strong&gt; Relative paths simplify the process of pointing Dremio at a migrated or replicated Iceberg table without rewriting metadata paths.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better performance on ML and streaming workloads.&lt;/strong&gt; Column families and single-file commits make the underlying Iceberg tables more efficient for the types of workloads Dremio&apos;s semantic layer and agent interfaces are designed to query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No catalog lock-in.&lt;/strong&gt; Dremio&apos;s REST catalog support means it can point at any v4-compatible catalog, like Polaris, Unity Catalog, Snowflake Horizon, or Nessie, and query the same adaptive metadata trees without platform-specific configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Dremio MCP server and AI Agent can take advantage of faster metadata planning to provide sub-second responses on agentic queries against large Iceberg tables, and the convergence between Iceberg v4 and Delta 5.0 means more of your organization&apos;s data, regardless of which format it was originally written in, becomes accessible through Dremio&apos;s governed semantic layer.&lt;/p&gt;
&lt;h2&gt;Bottom Line&lt;/h2&gt;
&lt;p&gt;Iceberg v4 reworks the metadata architecture that has served the format since 2017. Adaptive metadata trees replace manifest-based indirection with a flatter, columnar structure that reduces commit latency and planning overhead. Relative paths make tables portable across clouds. Column families tackle the ML wide-table problem head-on. And the Delta 5.0 convergence, if it ships as proposed, closes the metadata-level gap between the two formats, shifting competitive differentiation to catalogs, governance, and semantic layers.&lt;/p&gt;
&lt;p&gt;For teams planning their 2026–2027 data architecture, the right approach is to view v4 as a direction, not a deliverable. Design for the principles v4 embodies: scalable metadata, portable tables, column-aligned storage, and format-neutral interoperability, without depending on any v4 proposal that has not shipped. The Iceberg community&apos;s strongest asset is its track record of shipping spec changes in collaboration with dozens of engine and platform vendors. The v4 roadmap continues that tradition with its most ambitious set of architectural proposals yet.&lt;/p&gt;
&lt;p&gt;For more detail on the Iceberg Summit 2026 announcements and the full session library, visit the &lt;a href=&quot;https://youtube.com/playlist?list=PLkifVhhWtccxSA6VskdKdLnIwCJevOqFL&quot;&gt;Iceberg Summit 2026 YouTube Playlist&lt;/a&gt;. To try Iceberg querying with Dremio&apos;s semantic layer and agent interfaces, start a free trial at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Lakehouse Context Layers with Atlan and Iceberg v3</title><link>https://iceberglakehouse.com/posts/atlan-snowflake-iceberg-v3-context-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/atlan-snowflake-iceberg-v3-context-layer/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-atlan-snowflake-iceberg-v3-conte...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-atlan-snowflake-iceberg-v3-context-layer/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The context layer explains what lakehouse data means, which is the part table formats do not solve alone. That is the useful lens for lakehouse context layer in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/atlan-snowflake-iceberg-v3-context-layer-diagram-1.png&quot; alt=&quot;lakehouse context layer architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind lakehouse context layer&lt;/h2&gt;
&lt;p&gt;Iceberg v3 improves the raw table substrate with lineage-oriented features, but AI agents and business users still need owners, definitions, classifications, quality signals, and approved metric logic.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;A metadata platform maps tables, columns, owners, lineage, classifications, and usage signals into a searchable context layer.&lt;/p&gt;
&lt;p&gt;Snowflake Horizon and Polaris-style catalogs make Iceberg metadata more visible to external tools.&lt;/p&gt;
&lt;p&gt;Atlan-style integrations can enrich that metadata with business meaning, glossary terms, and stewardship workflows.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/atlan-snowflake-iceberg-v3-context-layer-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A column named &lt;code&gt;arr&lt;/code&gt; may mean annual recurring revenue, adjusted response rate, or account risk rating. The table format can store the column. The context layer tells an agent which definition applies.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prioritize glossary terms for metrics agents actually use.&lt;/li&gt;
&lt;li&gt;Map column-level lineage for sensitive and executive-facing datasets first.&lt;/li&gt;
&lt;li&gt;Mark certified views clearly and hide experimental views from agent tools.&lt;/li&gt;
&lt;li&gt;Review AI-generated descriptions before they enter the trusted catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/atlan-snowflake-iceberg-v3-context-layer-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Auto-generated metadata can be wrong with confidence. Human approval, ownership, and periodic review still matter. A stale glossary can mislead an agent as badly as a missing glossary.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For lakehouse context layer, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For lakehouse context layer, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for lakehouse context layer from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For lakehouse context layer, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If lakehouse context layer is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For lakehouse context layer, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For lakehouse context layer, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that lakehouse context layer helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For lakehouse context layer, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of lakehouse context layer that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For lakehouse context layer, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For lakehouse context layer, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either lakehouse context layer has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For lakehouse context layer, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Real-Time Agentic Analytics with ClickHouse</title><link>https://iceberglakehouse.com/posts/clickhouse-real-time-agentic-analytics-event-loops/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/clickhouse-real-time-agentic-analytics-event-loops/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-clickhouse-real-time-agentic-ana...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-clickhouse-real-time-agentic-analytics-event-loops/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Real-time agents need analytical systems that can answer while an event still matters. That is the useful lens for real-time agentic analytics in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/clickhouse-real-time-agentic-analytics-event-loops-diagram-1.png&quot; alt=&quot;real-time agentic analytics architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind real-time agentic analytics&lt;/h2&gt;
&lt;p&gt;Batch architectures can support excellent reporting, but active agent loops have a different timing profile. If the agent has to detect an anomaly, check constraints, and trigger an action, the feedback loop must be measured in seconds or less.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Fast ingestion keeps fresh events visible to the analytical system.&lt;/p&gt;
&lt;p&gt;Low-latency aggregation lets the agent test whether a signal is noise or a real anomaly.&lt;/p&gt;
&lt;p&gt;Action policies define what the agent may do after it validates a signal.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/clickhouse-real-time-agentic-analytics-event-loops-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A commerce platform may detect checkout failures by region, compare the pattern against historical baselines, confirm that payment errors are above threshold, and open an incident or scale a service before the next batch window.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;ClickHouse&apos;s real-time focus clarifies an architectural truth: some workloads need specialized low-latency event stores. A lakehouse platform should federate across systems, present governed views, and keep the semantic layer consistent so agents do not treat every source as a separate truth.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Define which actions are advisory and which actions can execute automatically.&lt;/li&gt;
&lt;li&gt;Use threshold windows, anomaly validation, and rollback hooks.&lt;/li&gt;
&lt;li&gt;Keep event retention and aggregation strategy explicit.&lt;/li&gt;
&lt;li&gt;Expose real-time signals through certified semantic views when agents consume them.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/clickhouse-real-time-agentic-analytics-event-loops-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Real time does not forgive weak policy. An agent that acts quickly on a bad metric can create customer impact faster than a dashboard ever could.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For real-time agentic analytics, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For real-time agentic analytics, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for real-time agentic analytics from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For real-time agentic analytics, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If real-time agentic analytics is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For real-time agentic analytics, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For real-time agentic analytics, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that real-time agentic analytics helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For real-time agentic analytics, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of real-time agentic analytics that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For real-time agentic analytics, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For real-time agentic analytics, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either real-time agentic analytics has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For real-time agentic analytics, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Composable Analytics Beats Metric Catalogs</title><link>https://iceberglakehouse.com/posts/composable-analytics-semantic-layers-expressiveness/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/composable-analytics-semantic-layers-expressiveness/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-composable-analytics-semantic-la...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-composable-analytics-semantic-layers-expressiveness/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Metric catalogs tell agents what terms mean. Composable analytics tells agents how to reason with those terms safely. That is the useful lens for composable analytics in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/composable-analytics-semantic-layers-expressiveness-diagram-1.png&quot; alt=&quot;composable analytics architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind composable analytics&lt;/h2&gt;
&lt;p&gt;A list of metric names is useful, but it is not enough for agentic analytics. Agents compare periods, exclude cohorts, segment behavior, generate follow-up questions, and combine operations that can quietly break a metric definition.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;The first maturity level is raw text-to-SQL against tables.&lt;/p&gt;
&lt;p&gt;The second level is metric lookup, where the agent can find approved definitions.&lt;/p&gt;
&lt;p&gt;The third level is composable operations, where joins, filters, time comparisons, cohorts, and aggregations carry semantic rules.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/composable-analytics-semantic-layers-expressiveness-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;Customer churn sounds simple until the agent asks for churn among accounts that upgraded last quarter, excluding customers with open billing disputes, compared with the same quarter last year. A flat metric catalog cannot safely compose that answer by itself.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A governed semantic layer uses views, wikis, labels, and catalog context to shape how agents generate SQL. The lakehouse needs an AI-ready semantic layer, not just more dashboards.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mark a small set of metrics as certified before expanding.&lt;/li&gt;
&lt;li&gt;Test period comparisons, cohort filters, and exclusions explicitly.&lt;/li&gt;
&lt;li&gt;Keep semantic definitions versioned.&lt;/li&gt;
&lt;li&gt;Give agents only the semantic operations that have passed review.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/composable-analytics-semantic-layers-expressiveness-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Composable models require discipline. If the semantic layer is full of unreviewed shortcuts, agents will combine shortcuts and produce polished nonsense.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For composable analytics, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For composable analytics, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for composable analytics from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For composable analytics, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If composable analytics is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For composable analytics, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For composable analytics, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that composable analytics helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For composable analytics, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of composable analytics that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For composable analytics, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For composable analytics, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either composable analytics has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For composable analytics, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
&lt;h2&gt;Test cases that matter&lt;/h2&gt;
&lt;p&gt;Use test cases that reflect real business questions. For composable analytics, include at least one happy path, one denied-access path, one stale-data path, and one rollback path. Those tests reveal more than a generic demo query.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Goal-Directed Analytics Agents on Apache Iceberg</title><link>https://iceberglakehouse.com/posts/goal-directed-analytics-agents-apache-iceberg-action-loops/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/goal-directed-analytics-agents-apache-iceberg-action-loops/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-goal-directed-analytics-agents-a...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-goal-directed-analytics-agents-apache-iceberg-action-loops/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The next step after text-to-SQL is a governed action loop with checks before every external effect. That is the useful lens for goal-directed analytics agents in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/goal-directed-analytics-agents-apache-iceberg-action-loops-diagram-1.png&quot; alt=&quot;goal-directed analytics agents architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind goal-directed analytics agents&lt;/h2&gt;
&lt;p&gt;Chat interfaces are useful for exploration, but goal-directed agents need a loop: observe a signal, query the lakehouse, validate constraints, decide whether action is allowed, execute, and record what happened.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Iceberg gives the agent a snapshot-based analytical record that can be queried and rolled back.&lt;/p&gt;
&lt;p&gt;The semantic layer gives the agent approved business definitions.&lt;/p&gt;
&lt;p&gt;The action layer constrains API calls, remediation jobs, notifications, and writebacks.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/goal-directed-analytics-agents-apache-iceberg-action-loops-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A pipeline-healing agent may detect a freshness breach, query Iceberg table history, confirm that only one source partition is stale, trigger a backfill job, and write an incident note with the snapshot IDs it inspected.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;Fast governed SQL over open data, semantic context for business meaning, and MCP-style interfaces for external AI tools make the action loop auditable instead of improvised.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with read-only recommendations before automated writes.&lt;/li&gt;
&lt;li&gt;Require constraint checks before any webhook or writeback.&lt;/li&gt;
&lt;li&gt;Store snapshot IDs, query IDs, and tool-call IDs with every action.&lt;/li&gt;
&lt;li&gt;Run action loops in shadow mode before production automation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/goal-directed-analytics-agents-apache-iceberg-action-loops-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Every action loop needs stop conditions. Without thresholds, approval points, and rollback paths, an agent can make a bad assumption operationally expensive.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For goal-directed analytics agents, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For goal-directed analytics agents, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for goal-directed analytics agents from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For goal-directed analytics agents, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If goal-directed analytics agents is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For goal-directed analytics agents, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For goal-directed analytics agents, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that goal-directed analytics agents helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For goal-directed analytics agents, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of goal-directed analytics agents that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For goal-directed analytics agents, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For goal-directed analytics agents, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either goal-directed analytics agents has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For goal-directed analytics agents, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Iceberg Remote Signing for Regulated Datasets</title><link>https://iceberglakehouse.com/posts/iceberg-remote-signing-regulated-datasets/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-remote-signing-regulated-datasets/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-iceberg-remote-signing-regulated...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-iceberg-remote-signing-regulated-datasets/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Remote signing is the stricter pattern for lakehouse storage security because clients request signed file operations instead of receiving storage credentials. That is the useful lens for Iceberg remote signing in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-remote-signing-regulated-datasets-diagram-1.png&quot; alt=&quot;Iceberg remote signing architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Iceberg remote signing&lt;/h2&gt;
&lt;p&gt;Credential vending is a strong improvement over permanent keys, but some environments need a tighter boundary. Regulated workloads may require that compute engines never receive cloud authorization tokens at all.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;A client asks the catalog or signing service to authorize a specific file operation.&lt;/p&gt;
&lt;p&gt;The signer evaluates identity, table policy, path, method, and expiration before creating a signed request.&lt;/p&gt;
&lt;p&gt;The client executes only the signed operation, which can be audited and constrained more tightly than a general credential.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-remote-signing-regulated-datasets-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;For a table containing protected health information, an external engine may be allowed to read approved data files through signed GET operations but never receive a token that could list nearby objects or write to the bucket.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use remote signing for PII, PHI, financial records, and high-value intellectual property.&lt;/li&gt;
&lt;li&gt;Keep signer policies path-aware and table-aware.&lt;/li&gt;
&lt;li&gt;Record every signed operation with identity, table, path, method, and expiration.&lt;/li&gt;
&lt;li&gt;Load test signer latency before production rollout.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-remote-signing-regulated-datasets-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;The signer becomes a critical service. If it is slow, queries suffer. If it is unavailable, engines may lose access. If it logs poorly, the security gain is harder to prove during an audit.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Iceberg remote signing, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Iceberg remote signing, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Iceberg remote signing from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Iceberg remote signing, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Iceberg remote signing is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Iceberg remote signing, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Iceberg remote signing, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Iceberg remote signing helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Iceberg remote signing, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Iceberg remote signing that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Iceberg remote signing, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For Iceberg remote signing, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either Iceberg remote signing has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For Iceberg remote signing, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg v3 Deletion Vectors on Snowflake</title><link>https://iceberglakehouse.com/posts/iceberg-v3-deletion-vectors-snowflake-dml/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-v3-deletion-vectors-snowflake-dml/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-iceberg-v3-deletion-vectors-snow...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-iceberg-v3-deletion-vectors-snowflake-dml/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Deletion vectors matter because row-level changes should not require a full rewrite of every affected data file. That is the useful lens for Apache Iceberg v3 deletion vectors in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-v3-deletion-vectors-snowflake-dml-diagram-1.png&quot; alt=&quot;Apache Iceberg v3 deletion vectors architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Apache Iceberg v3 deletion vectors&lt;/h2&gt;
&lt;p&gt;Snowflake&apos;s Iceberg v3 support puts a long-running table-format debate into the hands of warehouse users: should deletes, updates, and merges rewrite data immediately, or should the system record row positions and let readers apply those changes later?&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Copy-on-write is easy to reason about because each DML operation creates replacement data files. The cost is write amplification. A small delete against a large Parquet file can rewrite far more bytes than the change logically touched.&lt;/p&gt;
&lt;p&gt;Merge-on-read shifts some work to read time. The writer records what changed, often as positional delete information, and the reader combines base files with delete metadata when planning and scanning the table.&lt;/p&gt;
&lt;p&gt;Iceberg v3 deletion vectors are a more compact way to represent row-level removals. Instead of materializing a long list of deleted positions, the table can carry bitmap-style delete information that points to rows inside a data file.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-v3-deletion-vectors-snowflake-dml-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;Take a 2 TB fact table partitioned by event date. A compliance job needs to delete 0.2 percent of rows across 400 files. With copy-on-write, the engine may rewrite hundreds of large files. With deletion vectors, the write path can record deleted positions and finish much faster, then let compaction clean the layout later.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;The lakehouse should keep data in open formats while engines compete on planning, acceleration, and governance. Iceberg carrying richer table semantics makes the agentic lakehouse stronger: fast governed access through the semantic layer, Reflections, and automatic table optimization.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Benchmark update, delete, and merge separately instead of reporting one blended DML number.&lt;/li&gt;
&lt;li&gt;Measure planning time, write time, read time after deletes, and compaction cost.&lt;/li&gt;
&lt;li&gt;Keep one compatibility matrix for every engine that reads or writes the table.&lt;/li&gt;
&lt;li&gt;Test rollback to a snapshot before and after deletion-vector-heavy operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-v3-deletion-vectors-snowflake-dml-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;The tradeoff is reader complexity. A query engine must understand the table version, apply delete metadata correctly, and avoid treating a table with deletion vectors like a plain Parquet directory. Compaction and vacuum policies also matter more because delete metadata can accumulate.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Apache Iceberg v3 deletion vectors, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Apache Iceberg v3 deletion vectors, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Apache Iceberg v3 deletion vectors from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Apache Iceberg v3 deletion vectors, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Apache Iceberg v3 deletion vectors is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Apache Iceberg v3 deletion vectors, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Apache Iceberg v3 deletion vectors, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Apache Iceberg v3 deletion vectors helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Apache Iceberg v3 deletion vectors, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Apache Iceberg v3 deletion vectors that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Apache Iceberg v3 deletion vectors, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>CDC Without Complexity Using Iceberg v3 Row Lineage</title><link>https://iceberglakehouse.com/posts/iceberg-v3-row-lineage-cdc/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-v3-row-lineage-cdc/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-iceberg-v3-row-lineage-cdc/).

R...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-iceberg-v3-row-lineage-cdc/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Row lineage gives Iceberg a native way to tell incremental consumers which rows changed and when they changed. That is the useful lens for Iceberg v3 row lineage in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-v3-row-lineage-cdc-diagram-1.png&quot; alt=&quot;Iceberg v3 row lineage architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Iceberg v3 row lineage&lt;/h2&gt;
&lt;p&gt;Change data capture often starts as a simple requirement and becomes a stack of log readers, replay jobs, deduplication rules, and late-arriving event logic. Iceberg v3 does not remove every CDC problem, but the new lineage fields give table consumers better metadata than a blind table scan.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;_row_id&lt;/code&gt; gives newly written rows a stable identity inside the table. It lets consumers track rows across snapshots without inventing identity from business columns that may not be stable.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;_last_updated_sequence_number&lt;/code&gt; tells a consumer the sequence number associated with the most recent update for that row. That gives downstream jobs a native filter for incremental processing.&lt;/p&gt;
&lt;p&gt;The design fits Iceberg&apos;s snapshot model. Instead of pretending the table is a message log, it exposes lineage through table metadata and row-level fields that engines can read with normal scan planning.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-v3-row-lineage-cdc-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A customer health model may need every row that changed since the last scoring run. Without row lineage, the team may compare snapshots, scan update timestamps, or read source logs. With v3 lineage, the incremental reader can work against Iceberg metadata and sequence numbers directly.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;Agentic analytics depends on trusted incremental context. A semantic layer can present approved business views while the underlying Iceberg table carries row-level lineage. The agent asks a business question, the platform maps it to governed SQL, and the table format supplies the change-aware foundation.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Define the checkpoint as an Iceberg sequence number, not as wall-clock time.&lt;/li&gt;
&lt;li&gt;Record each consumer&apos;s last processed snapshot and last processed sequence number.&lt;/li&gt;
&lt;li&gt;Expose incremental views through a governed query layer instead of handing agents raw table internals.&lt;/li&gt;
&lt;li&gt;Backfill one model from scratch and compare it to the lineage-based incremental result.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-v3-row-lineage-cdc-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Row lineage describes what landed in the Iceberg table. It does not replace every upstream event log, and it does not explain business intent. If an upstream system writes a correction, lineage can show that a row changed, but your semantic layer still has to explain whether the correction should change revenue, churn, inventory, or compliance reporting.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Iceberg v3 row lineage, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Iceberg v3 row lineage, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Iceberg v3 row lineage from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Iceberg v3 row lineage, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Iceberg v3 row lineage is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Iceberg v3 row lineage, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Iceberg v3 row lineage, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Iceberg v3 row lineage helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Iceberg v3 row lineage, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Iceberg v3 row lineage that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Iceberg v3 row lineage, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For Iceberg v3 row lineage, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The 2026 Guide to Iceberg View Federation</title><link>https://iceberglakehouse.com/posts/iceberg-view-federation-portable-sql-2026/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-view-federation-portable-sql-2026/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-iceberg-view-federation-portable...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-iceberg-view-federation-portable-sql-2026/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Portable views are the missing logic layer between open tables and multi-engine analytics. That is the useful lens for Iceberg view federation in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-view-federation-portable-sql-2026-diagram-1.png&quot; alt=&quot;Iceberg view federation architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Iceberg view federation&lt;/h2&gt;
&lt;p&gt;Open table formats solved a storage problem, not a business-logic problem. If Spark, Snowflake, Trino, and Dremio each define revenue or active customer differently, the lakehouse still produces inconsistent answers.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Portable views aim to store logical definitions in a form multiple engines can understand.&lt;/p&gt;
&lt;p&gt;UDF proposals matter because business logic often lives in functions, not just plain SELECT statements.&lt;/p&gt;
&lt;p&gt;View federation requires both syntax compatibility and policy compatibility. A view that runs everywhere but leaks restricted columns is not portable in any useful sense.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-view-federation-portable-sql-2026-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A customer 360 view may join CRM, billing, product usage, and support data. The business wants that view to mean the same thing whether a human queries it in BI or an agent queries it through a tool.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A semantic layer treats views, wikis, labels, and governed datasets as the AI context layer. Iceberg view federation makes the market more receptive to this: table formats are the base, but shared business logic is where trustworthy agentic analytics starts.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with views that are important and relatively simple.&lt;/li&gt;
&lt;li&gt;Build a cross-engine test suite for core metrics.&lt;/li&gt;
&lt;li&gt;Document unsupported functions instead of hiding them.&lt;/li&gt;
&lt;li&gt;Expose only certified views to AI agents.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/iceberg-view-federation-portable-sql-2026-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;SQL dialects do not disappear. Date functions, null rules, case sensitivity, and permissions can still differ across engines. Portable views need compatibility tests, not blind faith.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Iceberg view federation, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Iceberg view federation, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Iceberg view federation from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Iceberg view federation, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Iceberg view federation is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Iceberg view federation, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Iceberg view federation, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Iceberg view federation helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Iceberg view federation, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Iceberg view federation that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Iceberg view federation, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For Iceberg view federation, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either Iceberg view federation has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For Iceberg view federation, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Implementing MCP in the Lakehouse</title><link>https://iceberglakehouse.com/posts/mcp-lakehouse-semantic-data-layer-python/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/mcp-lakehouse-semantic-data-layer-python/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-mcp-lakehouse-semantic-data-laye...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-mcp-lakehouse-semantic-data-layer-python/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MCP gives AI clients a standard way to call governed lakehouse tools instead of guessing how to query your data. That is the useful lens for MCP lakehouse in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/mcp-lakehouse-semantic-data-layer-python-diagram-1.png&quot; alt=&quot;MCP lakehouse architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind MCP lakehouse&lt;/h2&gt;
&lt;p&gt;Model Context Protocol matters because agent tools need contracts. A model should not improvise credentials, SQL targets, or table names. It should call a narrow tool that validates input and returns governed results.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;An MCP server exposes tools with names, descriptions, parameters, and return values.&lt;/p&gt;
&lt;p&gt;The server can route requests to Dremio, a REST catalog, or a semantic service.&lt;/p&gt;
&lt;p&gt;Tool code must validate arguments, enforce identity, and log every call.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/mcp-lakehouse-semantic-data-layer-python-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A &lt;code&gt;get_revenue_by_region&lt;/code&gt; tool can accept region, start date, and end date, then query a Dremio certified view. The agent never needs direct access to raw tables or storage credentials.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;An MCP server and SQL engine fit naturally as the governed execution layer for external AI clients such as Claude, Cursor, or custom agents while preserving catalog permissions and semantic definitions.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expose semantic views and metrics first, not raw tables.&lt;/li&gt;
&lt;li&gt;Use allow lists for tool parameters.&lt;/li&gt;
&lt;li&gt;Add row limits, timeout limits, and query-cost guards.&lt;/li&gt;
&lt;li&gt;Log prompt context, tool arguments, user identity, and query ID.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/mcp-lakehouse-semantic-data-layer-python-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;MCP is not security by itself. A poorly written tool can leak data, run expensive queries, or expose unauthorized columns.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For MCP lakehouse, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For MCP lakehouse, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for MCP lakehouse from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For MCP lakehouse, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If MCP lakehouse is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For MCP lakehouse, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For MCP lakehouse, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that MCP lakehouse helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For MCP lakehouse, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of MCP lakehouse that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For MCP lakehouse, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For MCP lakehouse, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either MCP lakehouse has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For MCP lakehouse, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
&lt;h2&gt;Test cases that matter&lt;/h2&gt;
&lt;p&gt;Use test cases that reflect real business questions. For MCP lakehouse, include at least one happy path, one denied-access path, one stale-data path, and one rollback path. Those tests reveal more than a generic demo query.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Microsoft Fabric Build 2026 Agentic Analytics Stack</title><link>https://iceberglakehouse.com/posts/microsoft-fabric-build-2026-agentic-analytics-stack/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/microsoft-fabric-build-2026-agentic-analytics-stack/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-microsoft-fabric-build-2026-agen...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-microsoft-fabric-build-2026-agentic-analytics-stack/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Microsoft&apos;s Fabric direction shows that agentic analytics is becoming a platform architecture, not a chat feature. That is the useful lens for Microsoft Fabric agentic analytics in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/microsoft-fabric-build-2026-agentic-analytics-stack-diagram-1.png&quot; alt=&quot;Microsoft Fabric agentic analytics architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Microsoft Fabric agentic analytics&lt;/h2&gt;
&lt;p&gt;Build 2026 put more weight behind agents, context, Copilot, and Fabric IQ. The important part is not that Microsoft added more AI branding. The important part is the architecture: agents need governed data, shared semantics, and a storage layer they can trust.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;OneLake gives Fabric a shared storage substrate for analytical data.&lt;/p&gt;
&lt;p&gt;Semantic models and Fabric IQ provide context that agents can use when planning queries and explaining results.&lt;/p&gt;
&lt;p&gt;Copilot-style interfaces become more useful when they operate over governed business objects instead of raw tables alone.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/microsoft-fabric-build-2026-agentic-analytics-stack-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;An agent that answers why revenue fell in the Northeast has to know the approved revenue metric, the sales territory hierarchy, the time comparison rule, and which datasets are certified. That is a semantic problem before it is a model problem.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Inventory certified semantic models before exposing agents.&lt;/li&gt;
&lt;li&gt;Measure agent answers against known BI reports.&lt;/li&gt;
&lt;li&gt;Separate model evaluation from data-contract evaluation.&lt;/li&gt;
&lt;li&gt;Keep an open exit path for core data and metadata.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/microsoft-fabric-build-2026-agentic-analytics-stack-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;A platform-owned agent stack can become too closed if customers cannot bring open engines, catalogs, and tools. Buyers should ask how far the semantics travel outside one vendor&apos;s workspace.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Microsoft Fabric agentic analytics, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Microsoft Fabric agentic analytics, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Microsoft Fabric agentic analytics from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Microsoft Fabric agentic analytics, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Microsoft Fabric agentic analytics is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Microsoft Fabric agentic analytics, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Microsoft Fabric agentic analytics, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Microsoft Fabric agentic analytics helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Microsoft Fabric agentic analytics, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Microsoft Fabric agentic analytics that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Microsoft Fabric agentic analytics, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For Microsoft Fabric agentic analytics, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either Microsoft Fabric agentic analytics has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Modern Python Tooling for Apache Iceberg</title><link>https://iceberglakehouse.com/posts/python-tooling-apache-iceberg-pyiceberg-iceframe-iceberg-cli/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/python-tooling-apache-iceberg-pyiceberg-iceframe-iceberg-cli/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-python-tooling-apache-iceberg-py...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-python-tooling-apache-iceberg-pyiceberg-iceframe-iceberg-cli/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Python has become a practical Iceberg control plane for metadata work, catalog automation, and smaller operational workflows. That is the useful lens for Python Apache Iceberg tooling in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/python-tooling-apache-iceberg-pyiceberg-iceframe-iceberg-cli-diagram-1.png&quot; alt=&quot;Python Apache Iceberg tooling architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Python Apache Iceberg tooling&lt;/h2&gt;
&lt;p&gt;Spark is still important for distributed processing, but not every Iceberg task deserves a Spark cluster. Python tools can inspect schemas, snapshots, manifests, catalogs, and table properties quickly enough for developer workflows and agent tools.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;PyIceberg gives Python applications a native way to load catalogs, inspect tables, and work with Iceberg metadata.&lt;/p&gt;
&lt;p&gt;CLI tooling is useful for operators who need repeatable checks in CI, release scripts, or support runbooks.&lt;/p&gt;
&lt;p&gt;Higher-level Python helpers can wrap common tasks such as schema review, snapshot inspection, table health checks, and metadata audits.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/python-tooling-apache-iceberg-pyiceberg-iceframe-iceberg-cli-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A data platform team can run a CI check that loads a table through the REST catalog, validates required properties, confirms the current format version, and fails the pull request if a schema change removes an approved business column.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;Agentic analytics needs programmable, governed interfaces. SQL, Python libraries, and MCP-oriented patterns give teams the tools they need. Python Iceberg tooling provides a lower-level inspection layer that pairs well with a higher-level semantic and query layer.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Python for catalog inspection, schema checks, and snapshot audits.&lt;/li&gt;
&lt;li&gt;Keep large scans on Dremio, Spark, Flink, Trino, or another execution engine built for that work.&lt;/li&gt;
&lt;li&gt;Store catalog configuration outside code and rotate credentials like any other production secret.&lt;/li&gt;
&lt;li&gt;Wrap agent-facing Python tools with validation and explicit allow lists.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/python-tooling-apache-iceberg-pyiceberg-iceframe-iceberg-cli-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Python tooling is not a replacement for a distributed engine on heavy scans. Treat it as metadata automation and targeted operations unless you have measured the workload.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Python Apache Iceberg tooling, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Python Apache Iceberg tooling, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Python Apache Iceberg tooling from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Python Apache Iceberg tooling, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Python Apache Iceberg tooling is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Python Apache Iceberg tooling, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Python Apache Iceberg tooling, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Python Apache Iceberg tooling helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Python Apache Iceberg tooling, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Python Apache Iceberg tooling that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Python Apache Iceberg tooling, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For Python Apache Iceberg tooling, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either Python Apache Iceberg tooling has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>REST Catalog Credential Vending for Lakehouse Security</title><link>https://iceberglakehouse.com/posts/rest-catalog-credential-vending-secure-lakehouse-storage/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/rest-catalog-credential-vending-secure-lakehouse-storage/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-rest-catalog-credential-vending-...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-rest-catalog-credential-vending-secure-lakehouse-storage/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Credential vending lets the catalog issue short-lived storage access instead of spreading permanent cloud keys across every engine. That is the useful lens for REST catalog credential vending in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/rest-catalog-credential-vending-secure-lakehouse-storage-diagram-1.png&quot; alt=&quot;REST catalog credential vending architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind REST catalog credential vending&lt;/h2&gt;
&lt;p&gt;Open lakehouses often fail at the security boundary between catalogs and object storage. Engines need to read files, but giving every engine long-lived S3, ADLS, or GCS credentials creates a wide blast radius.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;The REST catalog receives a request from an authenticated client and evaluates table identity, user identity, and policy.&lt;/p&gt;
&lt;p&gt;If the request is allowed, the catalog returns scoped credentials with a short lifetime and a narrow storage scope.&lt;/p&gt;
&lt;p&gt;The engine uses those credentials to read or write only the paths needed for the approved table operation.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/rest-catalog-credential-vending-secure-lakehouse-storage-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A Spark job that appends to one Iceberg table does not need account-wide object store access. With credential vending, it receives temporary access scoped to that table&apos;s storage locations.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set short token lifetimes and avoid broad bucket scopes.&lt;/li&gt;
&lt;li&gt;Tie issued credentials to catalog identity and table identity.&lt;/li&gt;
&lt;li&gt;Log credential issuance and storage operation intent.&lt;/li&gt;
&lt;li&gt;Use remote signing for datasets where clients should never see storage tokens.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/rest-catalog-credential-vending-secure-lakehouse-storage-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Vended credentials are still credentials. A compromised client can use them until they expire. For highly regulated datasets, remote signing may be stricter because the client never receives storage tokens.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For REST catalog credential vending, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For REST catalog credential vending, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for REST catalog credential vending from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For REST catalog credential vending, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If REST catalog credential vending is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For REST catalog credential vending, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For REST catalog credential vending, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that REST catalog credential vending helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For REST catalog credential vending, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of REST catalog credential vending that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For REST catalog credential vending, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For REST catalog credential vending, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either REST catalog credential vending has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>SaaS Buyers Now Inspect Your Semantic Layer</title><link>https://iceberglakehouse.com/posts/saas-procurement-semantic-layer-over-dashboards/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/saas-procurement-semantic-layer-over-dashboards/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-saas-procurement-semantic-layer-...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-saas-procurement-semantic-layer-over-dashboards/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Enterprise SaaS buyers increasingly want machine-readable data contracts, not only dashboards. That is the useful lens for SaaS semantic layer procurement in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/saas-procurement-semantic-layer-over-dashboards-diagram-1.png&quot; alt=&quot;SaaS semantic layer procurement architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind SaaS semantic layer procurement&lt;/h2&gt;
&lt;p&gt;Dashboards still matter, but buyers now have internal agents that need governed access to product data. A vendor that can expose a clean semantic layer is easier to integrate into that buyer&apos;s AI workflow.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Machine-readable schemas let customer-side agents discover available objects.&lt;/p&gt;
&lt;p&gt;MCP-style tool interfaces can expose safe actions and data retrieval paths.&lt;/p&gt;
&lt;p&gt;Semantic contracts explain metrics, dimensions, grain, freshness, and permission boundaries.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/saas-procurement-semantic-layer-over-dashboards-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A procurement team evaluating a customer-success platform may ask whether their internal agent can query renewal risk by segment, account owner, and product usage without exporting CSVs or scraping dashboards.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;The market is moving value from static dashboards toward governed data access. The Agentic Lakehouse, semantic layer, and MCP-style interfaces align with what buyers are starting to ask vendors to provide.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Publish a semantic contract for your highest-value product data.&lt;/li&gt;
&lt;li&gt;Expose read-only tools before write-capable tools.&lt;/li&gt;
&lt;li&gt;Give each customer tenant scoped credentials and audit trails.&lt;/li&gt;
&lt;li&gt;Treat semantic correctness as part of the product SLA.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/saas-procurement-semantic-layer-over-dashboards-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Opening semantic access creates support obligations. A vendor must define rate limits, scopes, audit logs, data retention, and escalation paths.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For SaaS semantic layer procurement, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For SaaS semantic layer procurement, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for SaaS semantic layer procurement from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For SaaS semantic layer procurement, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If SaaS semantic layer procurement is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For SaaS semantic layer procurement, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For SaaS semantic layer procurement, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that SaaS semantic layer procurement helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For SaaS semantic layer procurement, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of SaaS semantic layer procurement that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For SaaS semantic layer procurement, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For SaaS semantic layer procurement, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either SaaS semantic layer procurement has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For SaaS semantic layer procurement, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
&lt;h2&gt;Test cases that matter&lt;/h2&gt;
&lt;p&gt;Use test cases that reflect real business questions. For SaaS semantic layer procurement, include at least one happy path, one denied-access path, one stale-data path, and one rollback path. Those tests reveal more than a generic demo query.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Securing Agent Identities in the Lakehouse</title><link>https://iceberglakehouse.com/posts/securing-agent-identities-lakehouse-token-exchange/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/securing-agent-identities-lakehouse-token-exchange/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-securing-agent-identities-lakeho...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-securing-agent-identities-lakehouse-token-exchange/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Every lakehouse agent needs its own identity, scope, and audit trail. That is the useful lens for agent identities lakehouse in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/securing-agent-identities-lakehouse-token-exchange-diagram-1.png&quot; alt=&quot;agent identities lakehouse architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind agent identities lakehouse&lt;/h2&gt;
&lt;p&gt;The worst agent security pattern is one shared super-user token. It works during a demo and fails every serious governance review. Agent identities should map to the work they are allowed to perform.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;A token exchange can convert a user or workload identity into a short-lived agent credential.&lt;/p&gt;
&lt;p&gt;Catalog roles define which tables, namespaces, and operations the agent can access.&lt;/p&gt;
&lt;p&gt;Column masks, row filters, and tool allow lists reduce what the model can see or do even after authentication succeeds.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/securing-agent-identities-lakehouse-token-exchange-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A revenue explanation agent may read certified finance views but not raw invoices, tax identifiers, or customer emails. A data-quality remediation agent may write quarantine records but not update revenue tables.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A semantic layer, catalog permissions, and agent interfaces can preserve governance while making data accessible to AI. The point is not to give agents more power. It is to give them the right power with evidence.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ban shared super-user tokens for production agents.&lt;/li&gt;
&lt;li&gt;Map every agent to a role, owner, and purpose.&lt;/li&gt;
&lt;li&gt;Use short-lived credentials with explicit scopes.&lt;/li&gt;
&lt;li&gt;Audit denied attempts as carefully as allowed attempts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/securing-agent-identities-lakehouse-token-exchange-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Granular identity adds setup work. Teams need naming conventions, role reviews, rotation, and incident response procedures.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For agent identities lakehouse, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For agent identities lakehouse, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for agent identities lakehouse from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For agent identities lakehouse, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If agent identities lakehouse is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For agent identities lakehouse, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For agent identities lakehouse, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that agent identities lakehouse helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For agent identities lakehouse, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of agent identities lakehouse that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For agent identities lakehouse, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For agent identities lakehouse, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either agent identities lakehouse has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For agent identities lakehouse, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
&lt;h2&gt;Test cases that matter&lt;/h2&gt;
&lt;p&gt;Use test cases that reflect real business questions. For agent identities lakehouse, include at least one happy path, one denied-access path, one stale-data path, and one rollback path. Those tests reveal more than a generic demo query.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Bidirectional Iceberg Writes with Horizon Catalog</title><link>https://iceberglakehouse.com/posts/snowflake-horizon-catalog-bidirectional-iceberg-writes/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/snowflake-horizon-catalog-bidirectional-iceberg-writes/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-snowflake-horizon-catalog-bidire...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-snowflake-horizon-catalog-bidirectional-iceberg-writes/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Bidirectional Iceberg interoperability changes managed Iceberg from a read surface into a shared write contract. That is the useful lens for bidirectional Iceberg interoperability in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/snowflake-horizon-catalog-bidirectional-iceberg-writes-diagram-1.png&quot; alt=&quot;bidirectional Iceberg interoperability architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind bidirectional Iceberg interoperability&lt;/h2&gt;
&lt;p&gt;Snowflake&apos;s Horizon Catalog announcement matters because external engines historically had limited ability to operate on Snowflake-managed data. Polaris-backed catalog access changes that conversation by giving engines a standards-based path to read and write governed Iceberg tables.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;The catalog becomes the shared authority. Spark, Flink, Snowflake, and other engines need one place to resolve metadata, commit snapshots, and enforce table-level rules.&lt;/p&gt;
&lt;p&gt;Apache Polaris matters because it gives the market an open catalog implementation tied to the Iceberg REST catalog pattern instead of a private API.&lt;/p&gt;
&lt;p&gt;Bidirectional access is more than file visibility. It requires commit coordination, table version compatibility, conflict handling, and consistent security semantics.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/snowflake-horizon-catalog-bidirectional-iceberg-writes-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A streaming team may want Flink to write curated events while finance analysts query the same managed Iceberg table from Snowflake. The value is not that both systems can see Parquet. The value is that both systems can operate through a catalog contract.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;The broader market is validating the premise that open tables and open catalogs matter. An open ecosystem with federation, semantic views, Reflections, and AI interfaces across data that does not have to be copied into one warehouse is a stronger architecture than a single-vendor lock-in.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create an engine support matrix for read, append, merge, delete, schema evolution, and rollback.&lt;/li&gt;
&lt;li&gt;Use a single catalog authority for production writes.&lt;/li&gt;
&lt;li&gt;Run concurrent commit tests before allowing multiple writers.&lt;/li&gt;
&lt;li&gt;Document which engine owns compaction and table optimization.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/snowflake-horizon-catalog-bidirectional-iceberg-writes-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Shared writes raise the cost of sloppy governance. One poorly configured engine can write incompatible metadata, fail commits under load, or bypass expectations that another engine assumes. Compatibility testing becomes a release-management practice.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For bidirectional Iceberg interoperability, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For bidirectional Iceberg interoperability, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for bidirectional Iceberg interoperability from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For bidirectional Iceberg interoperability, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If bidirectional Iceberg interoperability is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For bidirectional Iceberg interoperability, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For bidirectional Iceberg interoperability, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that bidirectional Iceberg interoperability helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For bidirectional Iceberg interoperability, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of bidirectional Iceberg interoperability that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For bidirectional Iceberg interoperability, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For bidirectional Iceberg interoperability, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either bidirectional Iceberg interoperability has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic View Autopilot in Snowflake Semantic Studio</title><link>https://iceberglakehouse.com/posts/snowflake-semantic-view-autopilot-business-logic/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/snowflake-semantic-view-autopilot-business-logic/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-snowflake-semantic-view-autopilo...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-snowflake-semantic-view-autopilot-business-logic/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Autopilot can draft semantic views quickly, but production semantics still need human review, tests, and governance. That is the useful lens for Semantic View Autopilot in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/snowflake-semantic-view-autopilot-business-logic-diagram-1.png&quot; alt=&quot;Semantic View Autopilot architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind Semantic View Autopilot&lt;/h2&gt;
&lt;p&gt;Snowflake&apos;s Semantic Studio direction recognizes that AI agents need more than table names. They need relationships, metrics, dimensions, synonyms, and security boundaries. Autopilot is useful because the blank page is often the hardest part of semantic modeling.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;Autopilot-style tooling can inspect schemas, infer relationships, and propose semantic objects.&lt;/p&gt;
&lt;p&gt;The generated view should be reviewed like generated code, with owners and tests.&lt;/p&gt;
&lt;p&gt;Security boundaries need explicit masks, filters, and scopes. Inferred semantics should not become production policy automatically.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/snowflake-semantic-view-autopilot-business-logic-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;Autopilot may infer that &lt;code&gt;orders.customer_id&lt;/code&gt; joins to &lt;code&gt;customers.id&lt;/code&gt;. That is helpful. It still cannot know whether enterprise customers should include subsidiaries, whether test accounts are excluded, or whether a support user may see lifetime value.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Accept generated semantic views only through pull-request style review.&lt;/li&gt;
&lt;li&gt;Add data tests for grain, joins, metric totals, and restricted columns.&lt;/li&gt;
&lt;li&gt;Separate draft semantic objects from certified objects.&lt;/li&gt;
&lt;li&gt;Review every agent-facing semantic view with the data owner.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/snowflake-semantic-view-autopilot-business-logic-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;The danger is semantic overconfidence. A generated relationship can look correct while encoding the wrong business grain.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For Semantic View Autopilot, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For Semantic View Autopilot, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for Semantic View Autopilot from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For Semantic View Autopilot, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If Semantic View Autopilot is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For Semantic View Autopilot, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For Semantic View Autopilot, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that Semantic View Autopilot helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For Semantic View Autopilot, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of Semantic View Autopilot that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For Semantic View Autopilot, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For Semantic View Autopilot, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either Semantic View Autopilot has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
&lt;h2&gt;Compliance evidence&lt;/h2&gt;
&lt;p&gt;Save the evidence. For Semantic View Autopilot, keep validation output, approval records, denied-access tests, and rollback proof with the release notes. Future audits are easier when the team can show what it tested before launch.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Zero-Copy Mirroring for Modern Lakehouse Migration</title><link>https://iceberglakehouse.com/posts/zero-copy-mirroring-modern-lakehouse-migration/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/zero-copy-mirroring-modern-lakehouse-migration/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-zero-copy-mirroring-modern-lakeh...</description><pubDate>Mon, 08 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-zero-copy-mirroring-modern-lakehouse-migration/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Zero-copy mirroring gives teams a safer migration path because they can expose a lakehouse surface before they duplicate every byte or rewrite every workload. That is the useful lens for zero-copy lakehouse mirroring in June 2026. The market is not short on announcements. What matters is whether the new pattern changes ownership, performance, governance, and agent readiness in a way your team can operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/zero-copy-mirroring-modern-lakehouse-migration-diagram-1.png&quot; alt=&quot;zero-copy lakehouse mirroring architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The market signal behind zero-copy lakehouse mirroring&lt;/h2&gt;
&lt;p&gt;Warehouse migrations fail when teams treat them like a single heroic cutover. A better pattern is to mirror access, metadata, and validation first, then move workloads only after the lakehouse proves it can preserve results and governance.&lt;/p&gt;
&lt;p&gt;I care about this topic because it sits at the boundary between open data architecture and AI execution. Most companies are not choosing one engine for every workload anymore. They have warehouses, lakehouse engines, streaming systems, catalogs, metadata platforms, and now agents that ask for data through tools. The shared contract between those systems matters more than any single feature checkbox.&lt;/p&gt;
&lt;p&gt;The vendor-neutral reading is straightforward. If the underlying table and catalog standards get stronger, buyers get more freedom to choose the right engine for each job. Snowflake, Microsoft, ClickHouse, Atlan, Dremio, and the open-source Iceberg ecosystem all point to the same market reality: data platforms are becoming multi-engine and agent-facing.&lt;/p&gt;
&lt;h2&gt;How the architecture works&lt;/h2&gt;
&lt;p&gt;The source system remains the system of record during the first phase.&lt;/p&gt;
&lt;p&gt;The lakehouse exposes equivalent tables or views through metadata translation, external table registration, or Iceberg-backed mirroring.&lt;/p&gt;
&lt;p&gt;Validation jobs compare row counts, aggregates, permissions, freshness, and query results before production traffic moves.&lt;/p&gt;
&lt;p&gt;The important architectural habit is to separate responsibilities. The table format manages files, snapshots, schema evolution, and table metadata. The catalog manages identity, namespaces, commits, and access patterns. The query engine plans and executes work. The semantic layer maps raw data into business meaning. The agent interface decides which safe tools a model can call.&lt;/p&gt;
&lt;p&gt;That separation keeps the system honest. If a vendor says a workload is open, ask which layer is open. If a feature supports Iceberg, ask which Iceberg version, which operations, and which engines. If an agent can query data, ask whether it is querying raw tables or certified semantic views.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/zero-copy-mirroring-modern-lakehouse-migration-diagram-2.png&quot; alt=&quot;Operating model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A concrete operating example&lt;/h2&gt;
&lt;p&gt;A finance team can mirror the month-end revenue table into an Iceberg surface and query it through Dremio while the original warehouse remains live. The migration team compares reports for several cycles before moving more dashboards.&lt;/p&gt;
&lt;p&gt;That example is intentionally operational. Architecture diagrams are useful, but the design only proves itself when a real workload runs through it. I want to know who owns the table, which catalog authorizes the operation, which engine writes, which engine reads, which semantic view users see, and how the team detects a bad result.&lt;/p&gt;
&lt;p&gt;For agentic analytics, the same example gets stricter. A human analyst can notice ambiguity and ask a teammate. An agent will often keep going unless the tool interface stops it. That means your architecture needs approved definitions, scoped access, query limits, logging, and a clean rollback path before it needs a flashy chat experience.&lt;/p&gt;
&lt;p&gt;This is why I do not treat open table formats as the whole story. Apache Iceberg gives the platform a strong storage contract. It does not, by itself, define customer lifetime value, revenue recognition rules, data owner approval, or what an AI agent may do after it finds an anomaly. Those rules belong in catalogs, semantic layers, governance systems, and agent tools.&lt;/p&gt;
&lt;h2&gt;What this means for the lakehouse&lt;/h2&gt;
&lt;p&gt;Migrations work best when they are incremental. Query federation lets teams query data in place, then move only the workloads that have evidence behind them. The approach is not one giant replacement project. It is start where you are and evolve toward an open lakehouse architecture.&lt;/p&gt;
&lt;p&gt;A lakehouse platform needs five capabilities to serve agents reliably: query federation to reduce data movement; autonomous performance using Reflections, caching, and table optimization so interactive loops stay fast; an AI Semantic Layer that gives agents approved business context; agentic interfaces through the UI, Python, or MCP-connected tools; and AI SQL functions that bring model-assisted work into SQL without exporting data.&lt;/p&gt;
&lt;h2&gt;Implementation checklist&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What to document&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table contract&lt;/td&gt;
&lt;td&gt;Format version, schema rules, snapshot policy, and rollback plan&lt;/td&gt;
&lt;td&gt;Engines need the same understanding of the table.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog authority&lt;/td&gt;
&lt;td&gt;Production catalog, namespaces, commit rules, and role model&lt;/td&gt;
&lt;td&gt;Multi-engine systems need one source of table truth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine matrix&lt;/td&gt;
&lt;td&gt;Read, write, merge, delete, schema, and view support by engine&lt;/td&gt;
&lt;td&gt;A feature is not production-ready until the exact operation is tested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic layer&lt;/td&gt;
&lt;td&gt;Certified views, metric definitions, owners, and labels&lt;/td&gt;
&lt;td&gt;Agents need business meaning, not raw schemas alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Credential model, token lifetime, row filters, column masks, and audit logs&lt;/td&gt;
&lt;td&gt;Open access still needs strict governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Compaction, vacuum, retries, alerting, and incident ownership&lt;/td&gt;
&lt;td&gt;The design must survive failed jobs and bad deploys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My practical checklist for this topic is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pick one high-value workload with known owners.&lt;/li&gt;
&lt;li&gt;Create a mirrored Iceberg or federated surface without changing downstream consumers first.&lt;/li&gt;
&lt;li&gt;Compare query results across several business cycles.&lt;/li&gt;
&lt;li&gt;Move consumers only after the old and new paths produce explainable differences.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those items are not written down, the project is still in the demo stage. That does not mean the idea is weak. It means the operating model is not finished.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/june8batch/zero-copy-mirroring-modern-lakehouse-migration-diagram-3.png&quot; alt=&quot;Implementation checklist diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Failure modes worth respecting&lt;/h2&gt;
&lt;p&gt;Mirroring can create false confidence if validation is shallow. Matching row counts is not enough. You need permission parity, metric parity, null-handling checks, late-arrival behavior, and rollback tests.&lt;/p&gt;
&lt;p&gt;The other failure mode is semantic drift. A table can be technically valid while the business definition on top of it changes quietly. That is where many AI analytics projects fail. The model generates SQL against a table that exists, the query returns rows, and the answer looks plausible. The problem is that the answer used the wrong grain, the wrong filter, or the wrong metric definition.&lt;/p&gt;
&lt;p&gt;The fix is not a longer prompt. The fix is stronger data contracts. Certified semantic views should be easier for agents to use than raw tables. Sensitive columns should be masked or hidden before the model can ask for them. Write-capable tools should require intent, validation, and idempotency. Expensive queries should have limits. Every tool call should leave evidence.&lt;/p&gt;
&lt;p&gt;This is also where vendor-neutral thinking helps. Do not trust a platform because it has the best demo. Trust the platform when it gives you clear contracts between storage, catalog, semantic layer, engine, and agent. Trust it more when you can test those contracts with another engine or another client.&lt;/p&gt;
&lt;h2&gt;What I would do first&lt;/h2&gt;
&lt;p&gt;Start with one production-shaped workflow. Do not start with the easiest toy table, and do not start with the most politically sensitive workload. Pick a table or semantic view that matters, has an owner, has known correctness checks, and can tolerate a controlled pilot.&lt;/p&gt;
&lt;p&gt;For zero-copy lakehouse mirroring, I would write down five things before touching production: the owner, the accepted engines, the policy boundary, the rollback path, and the agent-facing interface. Then I would run the same workflow three ways: manually, through the intended query engine, and through the agent or automation layer. Differences between those paths are where the real work begins.&lt;/p&gt;
&lt;p&gt;Measure boring things. Count files. Count snapshots. Track query planning time. Track storage calls. Track failed commits. Track token issuance. Track denied access. Track whether a human can explain the result without reading tool logs for an hour. These metrics are not glamorous, but they tell you whether the architecture is ready.&lt;/p&gt;
&lt;h2&gt;Final recommendation&lt;/h2&gt;
&lt;p&gt;The right conclusion is not that every team should adopt every June 2026 feature immediately. The right conclusion is that the lakehouse is becoming an execution surface for humans and agents, and that changes the quality bar. Open storage is necessary. Governed catalogs are necessary. Semantic context is necessary. Fast SQL is necessary. Scoped agent tools are necessary.&lt;/p&gt;
&lt;p&gt;That combination is exactly why the Agentic Lakehouse is becoming the right framing. It describes the platform you need when AI agents stop answering isolated questions and start participating in analytical workflows.&lt;/p&gt;
&lt;p&gt;For more background on the lakehouse and AI side of this work, explore my books on data lakehouses and AI at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;. If you want to try this style of governed, open, agent-ready architecture in practice, start a free trial of Dremio&apos;s Agentic Lakehouse at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Field notes for teams evaluating this now&lt;/h2&gt;
&lt;p&gt;First, make compatibility visible. A table-format version, catalog endpoint, and engine release should appear in your runbook. If a production issue happens, nobody should have to guess which engine wrote the latest snapshot or which client introduced a metadata change.&lt;/p&gt;
&lt;p&gt;Second, keep the semantic layer close to the workflow. If the article topic affects analytics agents, customer-facing metrics, financial reporting, or regulated data, raw-table access should be the exception. Certified views should be the normal path.&lt;/p&gt;
&lt;p&gt;Third, separate experimentation from certification. Engineers need sandboxes where they can test new Iceberg features, catalog options, and agent tools. Business users and agents need certified surfaces where definitions, owners, and policies have already been reviewed.&lt;/p&gt;
&lt;p&gt;Fourth, keep the architecture open. Not every byte must move into one platform. An architecture that can query data in place, add semantic context, accelerate common workloads, and expose governed agent interfaces over open data creates more flexibility.&lt;/p&gt;
&lt;p&gt;Fifth, publish the limits. If a feature is read-only in one engine, say so. If write interoperability is approved only for append workloads, say so. If remote signing is required for regulated tables, say so. Clear limits create trust. Hidden limits create incidents.&lt;/p&gt;
&lt;h2&gt;Identity and access review&lt;/h2&gt;
&lt;p&gt;For zero-copy lakehouse mirroring, I would run one full dry run with production-like identities. Use an analyst identity, a service account, and the intended agent identity. Confirm that each identity sees only the expected semantic objects, receives predictable errors, and leaves useful audit records. That test catches policy gaps before they become production incidents.&lt;/p&gt;
&lt;p&gt;The agent identity matters most because it is easy to over-permission during a pilot. If the agent only needs a certified revenue view, do not give it namespace-wide table discovery. If the agent needs row-level access for one geography, test that a second geography returns a denial instead of silent leakage.&lt;/p&gt;
&lt;h2&gt;Documentation that actually helps&lt;/h2&gt;
&lt;p&gt;The documentation should fit on one page. Name the owner, the supported engines, the catalog authority, the accepted table operations, the security model, and the rollback path. If a new engineer cannot understand the contract for zero-copy lakehouse mirroring from that page, the architecture is still too implicit.&lt;/p&gt;
&lt;p&gt;Good documentation is not a wiki dump. It is an operating contract. It should say who can approve a schema change, which engine owns compaction, how long snapshots are retained, and what happens when an agent produces a suspicious result. That level of detail is what turns a promising pattern into a maintainable system.&lt;/p&gt;
&lt;h2&gt;How to keep agents in bounds&lt;/h2&gt;
&lt;p&gt;Agents should not receive broad table access just because a human can ask broad questions. For zero-copy lakehouse mirroring, expose narrow tools over certified views first. Add write-capable tools only after you have validation rules, idempotency keys, approval gates, and audit records that a reviewer can follow.&lt;/p&gt;
&lt;p&gt;The tool description should also be honest. If a tool returns estimated data, say estimated. If a tool excludes delayed transactions, say that. If a tool is read-only, make that clear in the name and policy. Agents work better when the interface gives them fewer chances to infer the wrong contract.&lt;/p&gt;
&lt;h2&gt;What to measure after launch&lt;/h2&gt;
&lt;p&gt;The first production month should be measurement-heavy. Track planning time, query latency, failed commits, denied access attempts, credential issuance, snapshot growth, and semantic-view usage. If zero-copy lakehouse mirroring is helping, the evidence should show up in fewer manual workarounds and clearer operational ownership.&lt;/p&gt;
&lt;p&gt;I would also track human trust signals. Are analysts using the certified view more often? Are engineers filing fewer tickets about unclear table ownership? Are agents producing answers that reviewers can trace back to approved definitions? Those signals tell you whether the architecture is improving daily work, not just passing a benchmark.&lt;/p&gt;
&lt;h2&gt;A buyer question worth asking&lt;/h2&gt;
&lt;p&gt;The buyer question is simple: does this pattern increase choice without weakening governance? For zero-copy lakehouse mirroring, the best answer is specific. It should name the table format, catalog contract, semantic surface, security controls, and engine support matrix. Anything less is a demo, not an operating model.&lt;/p&gt;
&lt;p&gt;This is where the architecture should stay disciplined. The point is not that open architecture is automatically better. The point is that open architecture gives you room to test engines, keep data in place, add semantic context, and still maintain control. That is a stronger argument than a generic platform claim.&lt;/p&gt;
&lt;h2&gt;A realistic rollout sequence&lt;/h2&gt;
&lt;p&gt;The rollout should start with read visibility, then move to operational automation, then consider action loops. For zero-copy lakehouse mirroring, the first milestone is a certified read path with approved semantics. The second milestone is repeatable validation through CI or scheduled checks. The third milestone is agent access with narrow tools and strict audit.&lt;/p&gt;
&lt;p&gt;Write paths should come later unless the topic itself is about write interoperability or table maintenance. Even then, begin with append-only or isolated writes. Updates, deletes, merges, and external actions need stronger controls because they change the state other people depend on.&lt;/p&gt;
&lt;h2&gt;How this should sound to executives&lt;/h2&gt;
&lt;p&gt;The executive version should avoid implementation trivia, but it should not become vague. Say that zero-copy lakehouse mirroring helps the company keep analytical data open, governed, and ready for AI-assisted work. Then say what the team will measure: cost, speed, correctness, access control, and operational effort.&lt;/p&gt;
&lt;p&gt;That framing is useful because executives do not need every catalog detail. They do need to know whether the architecture reduces lock-in, improves reliability, and gives agents a trustworthy data foundation. Those are business outcomes tied to technical choices.&lt;/p&gt;
&lt;h2&gt;How this should sound to engineers&lt;/h2&gt;
&lt;p&gt;The engineering version should be blunt. Which APIs are used? Which engine versions are approved? Which table operations are allowed? Which failures are retried? Which failures stop the workflow? Which logs prove that the right identity performed the right operation?&lt;/p&gt;
&lt;p&gt;For zero-copy lakehouse mirroring, those questions are more valuable than broad claims. They force the team to define the boundary between the open standard, the vendor implementation, the query engine, the semantic model, and the agent tool.&lt;/p&gt;
&lt;h2&gt;What not to automate yet&lt;/h2&gt;
&lt;p&gt;Do not automate the parts of zero-copy lakehouse mirroring that the team cannot explain manually. If nobody can explain the metric, the agent should not calculate it. If nobody can explain rollback, the agent should not write. If nobody can explain the security boundary, the tool should stay internal.&lt;/p&gt;
&lt;p&gt;This is not anti-automation. It is how automation earns trust. Automate the parts with clear contracts first, then widen the scope as evidence accumulates.&lt;/p&gt;
&lt;h2&gt;Source-of-truth ownership&lt;/h2&gt;
&lt;p&gt;Every production rollout needs one named source of truth for each layer. The table has an owner. The catalog has an owner. The semantic view has an owner. The agent tool has an owner. For zero-copy lakehouse mirroring, those owners may sit on different teams, but the contract between them has to be explicit.&lt;/p&gt;
&lt;p&gt;Clear ownership across all layers keeps the architecture credible, whether the governed execution and semantic layer lives in one platform or across several independent services.&lt;/p&gt;
&lt;p&gt;Clear ownership prevents avoidable production confusion.&lt;/p&gt;
&lt;h2&gt;Review cadence&lt;/h2&gt;
&lt;p&gt;Set a review cadence before the first production launch. For zero-copy lakehouse mirroring, I would review the contract after the first week, after the first month, and after the first engine or catalog upgrade. Most problems appear when a workflow that worked in a pilot meets a new version, a new identity, or a new business definition.&lt;/p&gt;
&lt;p&gt;That review should include both platform engineers and business owners. Engineers can verify the mechanics. Business owners can verify that the answers still mean what the company thinks they mean.&lt;/p&gt;
&lt;h2&gt;Launch criteria&lt;/h2&gt;
&lt;p&gt;The launch criteria should be binary. Either zero-copy lakehouse mirroring has a named owner, passing validation checks, approved security boundaries, working rollback, and documented engine support, or it is not ready. Gray areas are acceptable in a research project. They are expensive in production.&lt;/p&gt;
&lt;p&gt;This keeps the article&apos;s recommendation practical: prove the contract first, then widen adoption.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What AI Is and Isnt: A Laypersons Guide to How LLMs Actually Work</title><link>https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-1-what-ai-is-and-isnt/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-1-what-ai-is-and-isnt/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-1-what-ai-is-a...</description><pubDate>Mon, 01 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Welcome to &amp;quot;Catching Up with Using AI for All Levels,&amp;quot; a five-part series designed to take you from confused observer to confident AI user. This first post tackles the biggest problem with AI today: almost nobody understands what it actually is.&lt;/p&gt;
&lt;p&gt;You have probably seen the headlines. AI will replace your job. AI is a stupid autocomplete machine. AI is sentient. AI is just statistics. None of these capture the full picture, and the gap between what AI can do and what people think it can do keeps growing.&lt;/p&gt;
&lt;p&gt;This post gives you a working mental model of AI large language models in particular so you know what is really happening when you type a prompt into ChatGPT, Gemini, or Claude. You will learn about vectors and embeddings, how LLMs predict text, and the common misconceptions that lead to both overblown hype and unnecessary fear.&lt;/p&gt;
&lt;h2&gt;A Note on the Series&lt;/h2&gt;
&lt;p&gt;Before we dive in, here is a quick map of the full series so you can jump to the parts most relevant to you.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Part 1: What AI Is and Isnt (this post).&lt;/strong&gt; A plain English explanation of how LLMs work, what vectors and embeddings are, and the biggest misconceptions about AI capabilities.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Part 2: Getting Started for Free.&lt;/strong&gt; Every AI tool Google gives you for free right now from Gemini in Gmail to NotebookLM and AI Studio along with practical daily uses that cost nothing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Part 3: ChatGPT and Claude Deep Dive.&lt;/strong&gt; What you get at each paid tier, how to use desktop apps, Clips, Dispatch, and other features that change how you work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Part 4: A Tour of Specialized AI Tools.&lt;/strong&gt; The best tools for generating music, images, and video, and how they fit into real productivity workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Part 5: Going Advanced.&lt;/strong&gt; Hermes Agent, OpenCode, open weight models like DeepSeek, and running local models with Ollama for privacy and offline use.&lt;/p&gt;
&lt;h2&gt;The Core Idea: AI Predicts, It Does Not Think&lt;/h2&gt;
&lt;p&gt;Start with this single fact and everything else makes more sense. Large language models are prediction engines, not thinking machines. They take a sequence of words and predict the next most likely word. That is it. The entire multi trillion dollar AI industry rests on this one operation repeated billions of times.&lt;/p&gt;
&lt;p&gt;When you ask ChatGPT &amp;quot;What is the capital of France?&amp;quot; it does not look up the answer in a database. It does not reason about geography. It processes the sequence of words you typed and calculates that the most probable next tokens are &amp;quot;The capital of France is Paris.&amp;quot; It arrived at that answer because the training data the model saw hundreds of billions of pages of text contains that pattern many times.&lt;/p&gt;
&lt;p&gt;This is the crucial distinction that separates modern AI from human intelligence. When you reason about a problem, you start from first principles. You apply logic. You check your assumptions. The model does none of these things. It simply finds the most statistically probable output given the input it received. If the most common pattern in its training data for questions about French geography involves Paris, that is what it will output even if the question is about a different French city.&lt;/p&gt;
&lt;p&gt;This is why AI models sometimes give wrong answers with complete confidence. They are not lying. They have no concept of truth or falsehood. They are predicting the most plausible continuation based on their training data, and sometimes the most plausible continuation happens to be false. The model has no internal truth meter. It has no way to check its own work. It just predicts.&lt;/p&gt;
&lt;h3&gt;How This Plays Out in Practice&lt;/h3&gt;
&lt;p&gt;Understanding this prediction mechanism explains almost every quirk you have noticed when using AI tools.&lt;/p&gt;
&lt;p&gt;When you ask the same question twice and get different answers, it is not because the model changed its mind. The model has no mind to change. The randomness comes from a setting called temperature, which controls how aggressively the model picks the most probable word versus sampling from less probable alternatives. High temperature produces more creative, varied outputs. Low temperature produces more predictable, consistent ones. Both settings still run on pure next word prediction.&lt;/p&gt;
&lt;p&gt;When the model contradicts itself within the same conversation, it is because the context window the amount of text the model can see at once has filled up or because the model simply did not find a strong enough pattern connecting the earlier statement to the later one. The model has no memory. It only has the text currently in front of it, known as the context window, and its frozen training weights.&lt;/p&gt;
&lt;p&gt;When the model makes up facts or cites nonexistent sources, this is called hallucination. It happens because the pattern of a confident, authoritative sounding answer is statistically strong even when the specific facts in that answer are wrong. The model trained on millions of text examples where confident statements accompanied citations, so it reproduces that pattern even when the citation is made up. It does not know the difference.&lt;/p&gt;
&lt;h2&gt;The Language of AI: Vectors and Embeddings&lt;/h2&gt;
&lt;p&gt;To understand how a model predicts the next word, you need to understand vectors. This sounds technical, but the basic idea is simple.&lt;/p&gt;
&lt;p&gt;A vector is just a list of numbers. In AI, these numbers represent the meaning of a word, a sentence, or even an entire document. Think of a vector as a coordinate on a giant map of meaning. Words with similar meanings end up at nearby coordinates on this map.&lt;/p&gt;
&lt;p&gt;Here is the key insight. The model does not read words the way you do. It converts every word into a vector a set of numbers that encode the words meaning and context. This conversion is called an embedding.&lt;/p&gt;
&lt;p&gt;An embedding model takes the word &amp;quot;dog&amp;quot; and maps it to a vector in a high dimensional space. Maybe that vector is [0.23, 0.87, 0.12, 0.45, ...] with hundreds or thousands of numbers. The embedding for &amp;quot;puppy&amp;quot; lands very close to &amp;quot;dog&amp;quot; on this map. The embedding for &amp;quot;cat&amp;quot; is nearby too, but farther away. The embedding for &amp;quot;car&amp;quot; is somewhere completely different.&lt;/p&gt;
&lt;h3&gt;The Map of Meaning in More Detail&lt;/h3&gt;
&lt;p&gt;Imagine a two dimensional map where the X axis represents &amp;quot;animal versus object&amp;quot; and the Y axis represents &amp;quot;size.&amp;quot; Dog would be in the animal zone at a medium Y position. Puppy would be in the same animal zone but lower on the Y axis because it is smaller. Whale would be in the animal zone but very high on the Y axis. Car would be on the object side at varying Y positions.&lt;/p&gt;
&lt;p&gt;Real embeddings use not two dimensions but hundreds or thousands. Each dimension captures some aspect of meaning that the model discovered during training, though these dimensions do not always correspond to human interpretable concepts. The model figures out its own internal categories based on what helps predict the next word most accurately.&lt;/p&gt;
&lt;p&gt;This vector representation is what makes modern AI possible. It lets the model perform mathematical operations on meaning. You have probably seen the famous example: vector(&amp;quot;king&amp;quot;) minus vector(&amp;quot;man&amp;quot;) plus vector(&amp;quot;woman&amp;quot;) equals something close to vector(&amp;quot;queen&amp;quot;). This is not a trick. It is a direct consequence of how embeddings capture relationships in a continuous space. The vector for &amp;quot;king&amp;quot; contains the concept of royalty and the concept of masculinity. Subtracting &amp;quot;man&amp;quot; removes the masculinity component. Adding &amp;quot;woman&amp;quot; adds the feminine equivalent. The result lands near &amp;quot;queen.&amp;quot;&lt;/p&gt;
&lt;p&gt;When you type a sentence into an LLM, the model converts each word into its vector, processes those vectors through many layers of computation, and produces a vector that represents the most likely next word. Then it converts that vector back into text. This conversion loop is the fundamental operation.&lt;/p&gt;
&lt;h3&gt;Tokens: The Atomic Units of Language&lt;/h3&gt;
&lt;p&gt;You may hear the term &amp;quot;token&amp;quot; in AI discussions. Tokens are how the model actually sees text. Instead of processing word by word, most modern models break text into subword tokens. The word &amp;quot;unbelievable&amp;quot; might become [&amp;quot;un&amp;quot;, &amp;quot;believe&amp;quot;, &amp;quot;able&amp;quot;]. Common words like &amp;quot;the&amp;quot; get their own token. Rare words break into multiple tokens.&lt;/p&gt;
&lt;p&gt;On average, one English word equals roughly 1.3 tokens. This matters for two reasons. First, models have a maximum context window measured in tokens, not words. Second, API pricing is usually per token. A model with a 128,000 token context window can handle roughly 96,000 words, or about 190 pages of text. When you see a model advertised with a 1 million token context window, that is roughly 750,000 words. But bigger is not always better. Larger context windows use more memory and computation, and models sometimes struggle to find relevant information buried in very long contexts.&lt;/p&gt;
&lt;h2&gt;The Architecture: How Transformers Changed Everything&lt;/h2&gt;
&lt;p&gt;Before 2017, most language models struggled with context. They could look at a few words before the current one but lost track of anything farther back. Recurrent neural networks and LSTMs dominated the field, but they had a fundamental limitation: they processed text sequentially, one word at a time, and information had to flow through a narrow bottleneck at each step. Long sentences meant the beginning was essentially forgotten by the end.&lt;/p&gt;
&lt;p&gt;Then a team at Google published a paper called &amp;quot;Attention Is All You Need&amp;quot; that introduced the transformer architecture, and everything changed. The paper was modest in length, only about 12 pages, but its impact reshaped the entire field of AI.&lt;/p&gt;
&lt;p&gt;The transformer is the architectural backbone of every major LLM today. GPT, Claude, Gemini, Llama, DeepSeek they all use transformers. The key innovation is a mechanism called self attention.&lt;/p&gt;
&lt;h3&gt;Self Attention: The Secret Sauce&lt;/h3&gt;
&lt;p&gt;Self attention lets the model weigh the importance of every word in the input relative to every other word. When the model processes the sentence &amp;quot;The cat sat on the mat because it was tired,&amp;quot; self attention helps the model figure out that &amp;quot;it&amp;quot; refers to &amp;quot;the cat&amp;quot; not &amp;quot;the mat.&amp;quot; It does this by calculating attention scores: how strongly does each word relate to each other word?&lt;/p&gt;
&lt;p&gt;Think of it this way. When you read a sentence, you subconsciously connect pronouns to their referents, adjectives to the nouns they modify, and verbs to their subjects. Self attention does the same thing mathematically. The model computes a score for every pair of words in the input, determining how much attention each word should pay to every other word.&lt;/p&gt;
&lt;p&gt;The attention mechanism works by comparing each word against every other word in three roles: a query, a key, and a value. Think of it like a library search. The query is what you are looking for. The keys are the labels on the bookshelves. The values are the actual books. The model finds which keys match the query best, then retrieves the corresponding values.&lt;/p&gt;
&lt;p&gt;This happens for every word simultaneously. The model computes attention for all word pairs in a single parallel operation, which is why transformers are so efficient on GPUs. GPUs excel at parallel matrix math, and attention is fundamentally matrix multiplication.&lt;/p&gt;
&lt;h3&gt;The Stacked Layers&lt;/h3&gt;
&lt;p&gt;The transformer stacks many of these attention layers on top of each other. Each layer captures different levels of relationship. Early layers might capture simple grammar patterns like subject verb agreement or adjective noun pairs. Middle layers capture more complex semantics like synonyms, analogies, and coreference. Deep layers capture long range dependencies and abstract concepts like sentiment, narrative structure, or logical flow.&lt;/p&gt;
&lt;p&gt;A typical model might have 32, 64, or even 96 layers stacked together. Between each attention layer, the model also runs a feed forward network, a simpler set of computations that processes each positions representation independently. This alternating pattern of attention plus feed forward processing is the standard transformer block.&lt;/p&gt;
&lt;p&gt;After each block, the model uses residual connections and layer normalization. Residual connections simply add the input of a layer to its output, which helps information flow through the deep stack without degrading. Layer normalization adjusts the numbers to keep them in a stable range, preventing the values from growing out of control as they pass through many layers.&lt;/p&gt;
&lt;h3&gt;Position Encoding: Knowing Where Words Are&lt;/h3&gt;
&lt;p&gt;Attention mechanisms process all words simultaneously, which means the model needs some way to know the order of words. A bag of words loses all sentence structure. &amp;quot;Dog bites man&amp;quot; and &amp;quot;Man bites dog&amp;quot; have the same words in different positions with very different meanings.&lt;/p&gt;
&lt;p&gt;The original transformer solved this with position encodings, a set of sine and cosine waves at different frequencies added to each words embedding. The wave pattern tells the model where each word sits in the sequence. Newer models use learned position embeddings or rotary position embeddings (RoPE), which encode relative distances between words rather than absolute positions. RoPE is now the standard in most modern models because it handles variable length inputs more naturally.&lt;/p&gt;
&lt;p&gt;The beauty of position encoding is that it lets the model distinguish between identically worded sentences with different meanings and understand that words near each other are more likely to be related than words far apart.&lt;/p&gt;
&lt;h2&gt;What Training Actually Means&lt;/h2&gt;
&lt;p&gt;Training an LLM sounds mysterious, but the process is straightforward in concept. You take a massive amount of text trillions of words scraped from the internet, books, academic papers, and code repositories. You show pieces of that text to the model and ask it to predict the next word. You compare the models prediction to the actual next word. You adjust the models internal parameters to make the prediction slightly better next time. Repeat this a few trillion times.&lt;/p&gt;
&lt;h3&gt;Pre Training: The Main Event&lt;/h3&gt;
&lt;p&gt;The first and most expensive phase is pre training. This is where the model learns language structure, factual knowledge, and reasoning patterns from raw text. The model starts with random weights and gradually converges toward useful ones. Pre training a state of the art model costs tens or hundreds of millions of dollars in compute alone and can take months even on thousands of GPUs running in parallel.&lt;/p&gt;
&lt;p&gt;The &amp;quot;parameters&amp;quot; in a model are the numbers that define how it processes inputs. Each parameter is a weight that determines how much influence one part of the model has on another. Training adjusts these weights to minimize prediction error. After training, the model has encoded statistical patterns from the training data into its weights. It has no memory of specific training examples. It only has the compressed statistical essence of all that text.&lt;/p&gt;
&lt;p&gt;Think of it as extreme compression. The entire public internet in 2025 was roughly 100 petabytes of text. A 700 billion parameter model stored at 16 bit precision takes about 1.4 terabytes. The model compresses all that knowledge into 1.4 terabytes of weights. The compression is lossy, which is why the model forgets details, mixes things up, and makes things up. But the compression is also remarkably effective at preserving high level patterns.&lt;/p&gt;
&lt;h3&gt;Fine Tuning: Teaching Manners and Format&lt;/h3&gt;
&lt;p&gt;Pre training produces a raw model that can predict text but has no instruction following ability. If you ask a raw pre trained model &amp;quot;What is the capital of France?&amp;quot; it might continue the text with &amp;quot;of France is a beautiful country with many famous landmarks...&amp;quot; because that is a statistically likely continuation of the phrase &amp;quot;the capital of France.&amp;quot;&lt;/p&gt;
&lt;p&gt;Fine tuning converts the raw predictor into an assistant. This usually involves two stages.&lt;/p&gt;
&lt;p&gt;First, supervised fine tuning (SFT). You create a dataset of instruction response pairs. Humans write examples like &amp;quot;What is the capital of France?&amp;quot; with the correct answer &amp;quot;Paris.&amp;quot; The model trains on these pairs to learn the instruction following format. This stage is relatively cheap compared to pre training.&lt;/p&gt;
&lt;p&gt;Second, reinforcement learning from human feedback (RLHF). This is the secret sauce that makes models helpful and safe. Human raters compare multiple model outputs for the same prompt and rank them by quality. The model learns to prefer outputs that humans rated highly. This process aligns the model with human preferences: be helpful, be honest, avoid harmful content.&lt;/p&gt;
&lt;h3&gt;Why Models Hallucinate&lt;/h3&gt;
&lt;p&gt;This training process explains why models hallucinate. The model learned from pre training that confident, detailed answers are statistically common in its training data. It learned from fine tuning that it should always try to answer rather than saying &amp;quot;I don&apos;t know.&amp;quot; The combination creates a system that generates plausible sounding answers regardless of their factual accuracy.&lt;/p&gt;
&lt;p&gt;The model has no mechanism to distinguish between &amp;quot;I know this fact from my training data&amp;quot; and &amp;quot;I am making up something that sounds like the kind of fact I would know.&amp;quot; Both cases produce the same kind of output: a confident, well structured answer.&lt;/p&gt;
&lt;h2&gt;Common Misconceptions About AI&lt;/h2&gt;
&lt;p&gt;Lets address the most common misunderstandings directly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misconception: AI understands what it is saying.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The model does not understand anything in the human sense. It has no consciousness, no beliefs, no preferences. When it says &amp;quot;I think&amp;quot; or &amp;quot;In my opinion,&amp;quot; those are linguistic patterns it has learned from human text. The model has no thoughts or opinions to express. It is generating text that matches the pattern of a human giving an opinion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misconception: AI is just autocomplete.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This one is closer to true than the sentience claim, but it undersells what the technology can do. Yes, the core mechanism is next word prediction. But the scale and architecture create emergent capabilities that simple autocomplete cannot match. A model with hundreds of billions of parameters trained on the entire public internet can write code, solve math problems, translate languages, and generate creative fiction. Calling it &amp;quot;just autocomplete&amp;quot; is like calling a modern smartphone &amp;quot;just a walkie talkie.&amp;quot; Technically true at the most basic level, but missing the point entirely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misconception: AI will replace all jobs immediately.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This fear has some basis but the timeline is consistently overstated. AI today is a powerful tool that augments human capability rather than replacing it entirely. It excels at specific tasks: drafting text, summarizing documents, generating code snippets, brainstorming ideas. It struggles with tasks that require physical presence, complex negotiation, long term strategic thinking with incomplete information, and tasks where mistakes have serious consequences without human oversight.&lt;/p&gt;
&lt;p&gt;The real pattern is not replacement. It is shift. People who use AI effectively will become more productive than those who do not, and some roles will shrink. But the economy adapts, and new roles emerge. The best defense is to learn how to use these tools now.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misconception: AI is biased because the developers are biased.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Models reflect the data they were trained on. The internet contains plenty of biased, racist, sexist, and otherwise problematic content. When a model produces biased output, it is reproducing patterns from its training data. It has no intent. The bias is a data quality problem, not a moral failing of the model. This is why alignment training, instruction tuning, and safety filters exist. Companies spend significant effort to reduce harmful outputs, but the underlying training data still shapes the models behavior.&lt;/p&gt;
&lt;p&gt;The real challenge is that bias is subtle. A model trained mostly on English language content from Western sources will have a Western centric worldview. It will be better at answering questions about US history than about Southeast Asian history. It will default to cultural norms from the training data. This is not malice. It is a reflection of what data was available and what was prioritized during training.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misconception: You need to be a programmer to use AI.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This might have been true in 2022, but it is completely wrong in 2026. The best AI tools have chat interfaces, mobile apps, voice input, and integrations with everyday software like email and calendars. You do not need to write a single line of code to get real value from these tools. Parts 2 and 3 of this series focus entirely on non technical use cases.&lt;/p&gt;
&lt;p&gt;The most popular AI applications today Chrome, Gmail, Google Docs, Microsoft Office all have AI features built directly into the interface. You click a button that says &amp;quot;Help me write&amp;quot; or &amp;quot;Summarize this email.&amp;quot; No prompt engineering required. No API keys. No code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misconception: AI is getting smarter every day.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This one partially true but misleading. The AI you use today is the same AI you used last month. Models do not learn from your conversations. They are frozen snapshots of a training process that happened months ago. When you hear about AI &amp;quot;improving,&amp;quot; it usually means a new model version was released, not that the model you are using got smarter on its own.&lt;/p&gt;
&lt;p&gt;The rapid pace of new model releases creates the illusion of continuous improvement. OpenAI releases GPT 5.1, then 5.2, then 5.3. Anthropic releases Claude Opus 4.5, then 4.6. Each version is a new frozen model with better training, not the same model learning over time.&lt;/p&gt;
&lt;h2&gt;The Limits of Current AI&lt;/h2&gt;
&lt;p&gt;Understanding what AI cannot do is just as important as understanding what it can.&lt;/p&gt;
&lt;p&gt;AI cannot reason reliably. It can pattern match its way to correct answers on many reasoning tasks, but it falls apart on problems that require genuine logical deduction. Change a few details in a math word problem and the model might fail entirely because it was matching the pattern of similar problems rather than reasoning through the new one.&lt;/p&gt;
&lt;p&gt;AI cannot plan over long horizons. If you ask it to write a novel outline, it will produce something that looks reasonable. But if you ask it to write the novel one chapter at a time, it will often contradict itself or lose the thread by chapter five. Each prediction step is local, and there is no mechanism for maintaining global coherence over very long outputs.&lt;/p&gt;
&lt;p&gt;AI cannot learn from experience in the moment. When you correct the model mid conversation, it appears to learn. It says &amp;quot;You are right, I apologize for the error.&amp;quot; But the next time you start a fresh conversation, it will make the same mistake again. The model has no persistent memory of your conversation unless the application explicitly saves context. Each conversation starts from the same frozen set of trained weights.&lt;/p&gt;
&lt;p&gt;AI cannot verify its own outputs. The model cannot check whether its answer is correct because it has no internal mechanism for truth. It can only generate text that resembles correct answers it has seen. This is why human verification remains essential for any high stakes use case.&lt;/p&gt;
&lt;h2&gt;The Practical Takeaway&lt;/h2&gt;
&lt;p&gt;Here is how to think about AI productively. Treat it as the worlds fastest pattern matcher with the broadest training data ever assembled. It is incredibly good at tasks that involve transforming one form of text into another: summarizing a long document, translating between languages, converting a bullet list into a paragraph, turning a description into code. It is good at generating plausible first drafts that a human can refine. It is good at brainstorming and exploring ideas.&lt;/p&gt;
&lt;p&gt;It is bad at tasks that require precision, factual accuracy, logical consistency over long chains of reasoning, and any task where a confident wrong answer causes real harm.&lt;/p&gt;
&lt;p&gt;Use the tool for what it is good at. Verify everything important. And as you go through the rest of this series, you will see how to apply this mental model across free tools, paid services, specialized generators, and even local open source models.&lt;/p&gt;
&lt;h2&gt;Looking Ahead&lt;/h2&gt;
&lt;p&gt;Now that you understand the fundamentals, you are ready for Part 2: Getting Started for Free. That post walks through every AI tool Google gives you at no cost, including Gemini in Gmail, NotebookLM for research, Google AI Studio for experimentation, and a dozen other services you probably already have access to. No credit card required, no upgrade needed.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Continue to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Skip to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Skip to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Getting Started with AI for Free: Every Tool Google Gives You at No Cost</title><link>https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-2-getting-started-for-free/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-2-getting-started-for-free/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-2-getting-star...</description><pubDate>Mon, 01 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Most people think you need to pay $20 or $200 a month to get value from AI. That is not true. If you have a Google account which is free and most people do you already have access to a surprisingly powerful set of AI tools. No credit card required. No upgrade needed.&lt;/p&gt;
&lt;p&gt;This is Part 2 of &amp;quot;Catching Up with Using AI for All Levels.&amp;quot; In Part 1, we covered what AI actually is under the hood: prediction engines built on vectors and transformers, not thinking machines. Now we get practical. This post walks through every AI tool Google provides at no cost, what each one does well, and concrete ways to use them in your daily life. Part 3 will cover ChatGPT and Claude for when you are ready to pay. Part 4 covers specialized creative tools. Part 5 goes deep into open source and local models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If you missed Part 1: What AI Is and Isnt, start there for the foundation.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Continue to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Skip to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Google AI Ecosystem: A Quick Overview&lt;/h2&gt;
&lt;p&gt;Google has been an AI company longer than most people realize. They acquired DeepMind in 2014. They pioneered the transformer architecture that powers every major LLM today. They have the largest AI research organization in the world. And they have been quietly integrating AI into their free products for years.&lt;/p&gt;
&lt;p&gt;The 2026 free tier is more generous than any other major AI company offers. Here is what you get at no cost:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemini (the chatbot).&lt;/strong&gt; Access to Google&apos;s Gemini model through gemini.google.com, the Gemini mobile app, and integrations across Google services. The free tier includes text chat, voice input, image analysis, and limited file uploads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemini in Gmail and Google Docs.&lt;/strong&gt; The &amp;quot;Help me write&amp;quot; feature that drafts emails, summarizes threads, and polishes text. This is built into your existing free Google account.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NotebookLM.&lt;/strong&gt; Google&apos;s AI research assistant. Upload PDFs, YouTube videos, websites, and audio files. Get summaries, ask questions about your sources, and generate audio overviews. The free tier gives you 100 notebooks with 50 sources per notebook.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google AI Studio.&lt;/strong&gt; A web based playground for experimenting with Gemini models directly. You get API access with free rate limits for prototyping and testing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemini CLI.&lt;/strong&gt; Run Gemini models from your terminal or scripts. Up to 1,000 requests per day with a Google account.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google Antigravity.&lt;/strong&gt; Google&apos;s agent first IDE is free during public preview. Unlimited tab completions and command requests with weekly rate limits for fairness.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemini Code Assist for Individuals.&lt;/strong&gt; Free code completion and generation in VS Code, JetBrains, and other IDEs. Up to 180,000 code completions per month with no credit card.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Classic AI APIs.&lt;/strong&gt; Translation (500,000 characters/month), Speech to Text (60 minutes/month), Text to Speech (4 million standard plus 1 million WaveNet characters/month), Cloud Vision (1,000 units/month), Natural Language API (5,000 units/month), and Video Intelligence (1,000 minutes/month). All free with no expiration.&lt;/p&gt;
&lt;p&gt;Let us go through each one in detail with real productivity examples.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Gemini: Your Free AI Assistant&lt;/h2&gt;
&lt;p&gt;Gemini is Google&apos;s direct competitor to ChatGPT and Claude. The free tier is more limited than the paid versions, but it still covers a lot of ground.&lt;/p&gt;
&lt;h3&gt;What You Get for Free&lt;/h3&gt;
&lt;p&gt;The free tier gives you access to Gemini Pro models with text chat, voice input, and image understanding. You can upload images and ask questions about them. You can have it read and summarize PDFs and documents up to certain size limits. You get the Gemini mobile app on iOS and Android with voice conversations.&lt;/p&gt;
&lt;p&gt;What you do not get: access to Gemini Ultra or the latest flagship models, Gemini Spark for real time web searching, larger file uploads, priority access during high traffic, or integration with Google Calendar and Tasks. Those require a Google AI Premium ($20/month) or AI Ultra ($100/month) subscription.&lt;/p&gt;
&lt;h3&gt;Daily Productivity Uses for Free Gemini&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Email drafting.&lt;/strong&gt; Open Gmail, click the &amp;quot;Help me write&amp;quot; button, and describe the email you want to send. &amp;quot;Draft a short email to my team about rescheduling tomorrow&apos;s standup to 10:30 AM.&amp;quot; It writes the email in your voice. You review and send. This alone saves several minutes per email, and if you send ten emails a day, that adds up to nearly an hour a week.&lt;/p&gt;
&lt;p&gt;One trick that regular users discover is that you can train Gemini to match your writing style. Before asking it to draft an email, paste in three emails you wrote recently and say &amp;quot;Match this style.&amp;quot; The model picks up on your sentence length, formality level, and common phrases. Over time, the drafts need less editing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Document polishing.&lt;/strong&gt; In Google Docs, use &amp;quot;Help me write&amp;quot; to rewrite a paragraph, make it more concise, or change the tone. This works on any document in your Google Drive. The most practical use is the &amp;quot;Make shorter&amp;quot; and &amp;quot;Make more formal&amp;quot; buttons. Your first draft goes into the document. You mark the rough paragraph and ask for a polished version. Review the suggestions, accept the ones that work, and move on.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quick research.&lt;/strong&gt; Ask Gemini to explain a concept, summarize a news article, or give you an overview of a topic. It has access to current information through Google Search integration on the free tier. The research quality depends on how specific you are. &amp;quot;Explain how mortgages work&amp;quot; gives you a generic overview. &amp;quot;Compare 30 year fixed rate mortgages to 5 year adjustable rate mortgages for someone planning to stay in their home for 3 years&amp;quot; gives you a focused answer that helps with a real decision.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Image analysis.&lt;/strong&gt; Take a photo of a whiteboard after a meeting and ask Gemini to transcribe and organize the notes. Take a photo of a nutrition label and ask for a summary of the ingredients. Take a photo of a plant with brown leaves and ask what might be wrong. The vision capabilities work across photos, screenshots, and scanned documents.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Brainstorming.&lt;/strong&gt; &amp;quot;Give me ten ideas for team building activities that cost less than $50.&amp;quot; &amp;quot;Suggest three different ways to structure this presentation.&amp;quot; &amp;quot;Help me come up with a name for my new side project.&amp;quot; The key to good brainstorming with AI is to ask for quantity first, then refine. Ask for 20 ideas, not 5. Many will be unusable, but the few good ones make the exercise worthwhile.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Translation.&lt;/strong&gt; &amp;quot;Translate this email from Spanish to English.&amp;quot; &amp;quot;What does this French menu item mean?&amp;quot; Gemini handles over 100 languages. The translation quality is competitive with dedicated translation tools for common language pairs like Spanish English, French English, and Chinese English.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Content repurposing.&lt;/strong&gt; Paste a long memo into Gemini and ask for three versions: a one paragraph summary, a bullet point list of key takeaways, and a social media post announcing the main finding. This turns one piece of content into three formats in under a minute.&lt;/p&gt;
&lt;h3&gt;Where Free Gemini Falls Short&lt;/h3&gt;
&lt;p&gt;The free tier has lower rate limits than paid. During peak usage, you might get slower responses. You cannot use it for automated workflows or API calls at scale. The context window is smaller than the paid version, meaning it handles shorter documents. You also cannot connect it to your Google Calendar or Tasks to manage your schedule.&lt;/p&gt;
&lt;p&gt;If you find yourself hitting these limits regularly, the $20/month Google AI Premium plan removes most of them and adds integration with Google apps. But for casual daily use, the free tier is genuinely useful.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;NotebookLM: Your Personal Research Assistant&lt;/h2&gt;
&lt;p&gt;NotebookLM is arguably Google&apos;s most underrated free AI tool. It is a research assistant that works with your own documents. You upload sources, and the AI answers questions based only on those sources. It does not go off script. It does not make up facts from its general training data. It works from your materials.&lt;/p&gt;
&lt;h3&gt;What Makes NotebookLM Different&lt;/h3&gt;
&lt;p&gt;Unlike Gemini or ChatGPT, NotebookLM has a source grounded architecture. When you ask a question, it searches only the sources you uploaded and generates answers from that content. Every answer includes citations showing exactly which source and passage it used. This makes it much more reliable for fact based work.&lt;/p&gt;
&lt;p&gt;The free tier allows 100 notebooks with up to 50 sources each. Each source can contain up to 500,000 words. That is roughly 1,000 pages per source, or 50,000 pages per notebook. You are not going to hit these limits with normal use.&lt;/p&gt;
&lt;p&gt;NotebookLM supports PDFs, Google Docs, websites (paste a URL), YouTube videos (paste a link), and audio files. The audio feature is particularly notable. You can upload a recording of a meeting or lecture, and NotebookLM will transcribe it and let you ask questions about the content.&lt;/p&gt;
&lt;h3&gt;Audio Overviews: The Hidden Gem&lt;/h3&gt;
&lt;p&gt;NotebookLM can generate an Audio Overview a podcast style conversation between two AI hosts that discusses your sources. This is surprisingly good. The hosts summarize the material, make connections between sources, and ask each other questions. It sounds like a real podcast discussion about your research.&lt;/p&gt;
&lt;p&gt;This feature is useful for processing long documents while doing other things. Upload a 50 page report, generate an Audio Overview, and listen to the summary during your commute. You still need to read the original for details, but the overview gives you the big picture.&lt;/p&gt;
&lt;h3&gt;Daily Productivity Uses for NotebookLM&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Meeting prep.&lt;/strong&gt; Upload the agenda, previous meeting notes, and relevant documents. Ask NotebookLM to summarize the key decisions from the last meeting and list open action items. &amp;quot;Based on these notes, what topics are likely to come up in today&apos;s meeting?&amp;quot; The source grounding means every claim in the answer can be traced back to a specific document.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Research synthesis.&lt;/strong&gt; Upload five articles on the same topic. Ask NotebookLM to identify areas of agreement and disagreement between the sources. Ask it to extract all statistics mentioned across the sources into a table. Ask it to create a timeline of events described across the documents. This turns an hour of reading into five minutes of review.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning a new topic.&lt;/strong&gt; Upload a textbook chapter, a video transcript from a course, and a few relevant articles. Use NotebookLM as a tutor. &amp;quot;Explain this concept as if I were a beginner.&amp;quot; &amp;quot;Give me practice questions about chapter 3.&amp;quot; &amp;quot;Create a study guide with the ten most important concepts from these sources.&amp;quot; The fact that answers are source grounded means you can dig into the original material when something is unclear.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Job applications.&lt;/strong&gt; Upload the job description and your resume. Ask NotebookLM to identify gaps between your experience and the requirements, then suggest how to frame your experience for the role. &amp;quot;What keywords from the job description should I include in my cover letter?&amp;quot; Upload the company&apos;s About page and recent press releases, then ask for insights to use in an interview.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Legal and compliance reading.&lt;/strong&gt; Upload contracts or policy documents. Ask NotebookLM to extract key obligations, deadlines, and restrictions. &amp;quot;List all dates mentioned in this contract and what happens on each date.&amp;quot; &amp;quot;What are the termination conditions?&amp;quot; The source grounding means you can trust the answers more than a general chatbot, but you should still read the original document for anything important.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Client research.&lt;/strong&gt; Before a client meeting, upload their website, recent press releases, annual report, and any previous correspondence. Ask NotebookLM to create a briefing document: company overview, recent developments, likely priorities, and potential talking points. The briefing takes minutes to generate instead of hours of manual research.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Content creation workflow.&lt;/strong&gt; Upload your notes and research into NotebookLM. Ask it to organize the information into an outline. Export the outline to Google Docs and use Gemini to expand each section. Use the Audio Overview feature to generate a podcast style summary of the finished piece. One flow from research to outline to draft to audio, all within Google&apos;s free tools.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Google AI Studio: Where to Experiment&lt;/h2&gt;
&lt;p&gt;Google AI Studio is a web based environment for testing Gemini models directly. It is aimed at developers and curious users who want to understand how models behave.&lt;/p&gt;
&lt;p&gt;The free tier gives you API access with rate limits that are generous enough for learning and prototyping. You can try different models, adjust parameters like temperature and top_p, and see how the model responds to different prompts. The interface shows you the token usage, latency, and safety attributes for each request.&lt;/p&gt;
&lt;h3&gt;What You Can Do with AI Studio&lt;/h3&gt;
&lt;p&gt;You do not need to be a developer to use AI Studio. The interface is point and click. You type a prompt, pick a model, and see the output. But the real power comes from being able to tune parameters and see the effects.&lt;/p&gt;
&lt;p&gt;Try setting temperature to 0 and asking the same question three times. You get the same answer each time. Then set temperature to 1 and ask the same question. The answers vary. This is a practical way to understand the prediction mechanism we covered in Part 1.&lt;/p&gt;
&lt;p&gt;AI Studio also includes system instructions, a feature that lets you set the behavior and persona of the model before the conversation starts. &amp;quot;You are a Spanish tutor. Keep responses under 100 words. Include one grammar tip each response.&amp;quot; This changes how the model behaves without you having to repeat instructions in every message.&lt;/p&gt;
&lt;h3&gt;Why You Should Try AI Studio&lt;/h3&gt;
&lt;p&gt;Even if you never write code, spending 30 minutes in AI Studio will improve how you use every other AI tool. You will see how small changes in prompt wording change the output. You will understand why the same model gives different answers to the same question. You will learn about temperature, top_p, and system instructions concepts that apply across ChatGPT, Claude, and every other AI service.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Gemini Code Assist: AI for Your Coding (Even Beginners)&lt;/h2&gt;
&lt;p&gt;Gemini Code Assist for Individuals is free, requires no credit card, and works with VS Code, JetBrains IDEs, and other popular editors. It provides real time code completion, code generation from comments, and debugging assistance.&lt;/p&gt;
&lt;h3&gt;Wait, I Do Not Code&lt;/h3&gt;
&lt;p&gt;Fair point. If you are not a developer, this section might not apply directly. But keep reading for two reasons.&lt;/p&gt;
&lt;p&gt;First, you might encounter situations where a small script would save you hours of manual work. With Code Assist, you can describe what you want in plain English: &amp;quot;Write a Python script that renames all files in a folder to match a pattern.&amp;quot; It generates the code. You run it. Problem solved without learning to code.&lt;/p&gt;
&lt;p&gt;Second, you can use Gemini CLI even without being a programmer. It lets you run Gemini from the command line. The command &lt;code&gt;gemini ask &amp;quot;What is the weather in Chicago today?&amp;quot;&lt;/code&gt; works on any system with the CLI installed. No coding required.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Classic Google AI APIs: Free and Overlooked&lt;/h2&gt;
&lt;p&gt;Google Cloud offers free tiers for a dozen AI services that reset monthly and never expire. Most people do not know these exist. Here is what stands out.&lt;/p&gt;
&lt;h3&gt;Translation API&lt;/h3&gt;
&lt;p&gt;500,000 characters free per month. That is roughly 80,000 words of translation every month at no cost. You can use it through Google Translate on the web, which is free anyway, but the API lets developers integrate translation into applications. For personal use, the Google Translate app and website already give you the same capability.&lt;/p&gt;
&lt;h3&gt;Speech to Text&lt;/h3&gt;
&lt;p&gt;60 free minutes of transcription per month. Upload an audio file and get a timestamped transcript. This is useful for transcribing meeting recordings, lectures, interviews, or voice memos. The accuracy is good for clear audio and degrades with background noise or heavy accents.&lt;/p&gt;
&lt;h3&gt;Text to Speech&lt;/h3&gt;
&lt;p&gt;4 million characters of standard voices plus 1 million characters of WaveNet (high quality) voices per month. Convert any text into natural sounding speech. Useful for creating audio versions of documents, accessibility, or language learning.&lt;/p&gt;
&lt;h3&gt;Cloud Vision&lt;/h3&gt;
&lt;p&gt;1,000 free units per month. Upload an image and get back detected objects, faces, text, and landmarks. This is the technology behind Google Lens. On the free tier, you can use it through the Google Cloud Console web interface without writing any code.&lt;/p&gt;
&lt;h3&gt;Natural Language API&lt;/h3&gt;
&lt;p&gt;5,000 units per month for entity analysis, sentiment analysis, and syntax analysis. Paste in a block of text and get back the detected people, places, organizations, and the overall sentiment. Useful for analyzing customer feedback or social media mentions.&lt;/p&gt;
&lt;h3&gt;Video Intelligence&lt;/h3&gt;
&lt;p&gt;1,000 free minutes per month. Upload a video and get shot detection, label detection, and explicit content detection. This is more specialized, but if you work with video content, the free tier gives you significant processing capacity.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Google Antigravity: The New Player&lt;/h2&gt;
&lt;p&gt;Google Antigravity is Google&apos;s agent first IDE announced at Google I/O 2026. It is free during public preview. Think of it as AI development environment that goes beyond code completion. It integrates Gemini directly into the development workflow with tab completions, command requests, and agentic assistance.&lt;/p&gt;
&lt;p&gt;For non developers, Antigravity is less relevant today. But if the preview hints at the direction Google is heading, we can expect more agent driven tools that handle complex multi step tasks on your behalf. Keep an eye on this one.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Practical Workflows: Putting Free AI to Work&lt;/h2&gt;
&lt;p&gt;Here are complete workflows that use only free Google AI tools.&lt;/p&gt;
&lt;h3&gt;Morning Information Intake&lt;/h3&gt;
&lt;p&gt;Wake up. Open Gemini on your phone. Say &amp;quot;What happened overnight that I should know about?&amp;quot; Gemini summarizes news based on your interests. If you have a Google Nest Hub, Gemini is built into the smart display and can give you a verbal briefing while you make coffee.&lt;/p&gt;
&lt;h3&gt;Meeting Follow Up&lt;/h3&gt;
&lt;p&gt;After a meeting, upload the recording to NotebookLM. It transcribes the audio automatically. Ask NotebookLM: &amp;quot;Summarize the key decisions from this meeting. List the action items with owners. Identify any dates mentioned.&amp;quot; Export the summary as a document and share it with attendees.&lt;/p&gt;
&lt;h3&gt;Research for a Purchase Decision&lt;/h3&gt;
&lt;p&gt;You are buying a new laptop. Collect links to five review sites, two comparison articles, and the official specs sheet. Paste the URLs into NotebookLM sources. Ask: &amp;quot;Compare the MacBook Air M4 and the Dell XPS 16 on build quality, performance, battery life, and price. Create a table.&amp;quot; The answers are sourced from your documents, not from the model&apos;s general knowledge.&lt;/p&gt;
&lt;h3&gt;Writing Assistance&lt;/h3&gt;
&lt;p&gt;Open a Google Doc. Write a rough draft of your email or document. Use the &amp;quot;Help me write&amp;quot; sidebar to polish the language, shorten it, or adjust the tone. Then use Gemini to check: &amp;quot;Read this and tell me if there are any unclear sentences.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Language Learning&lt;/h3&gt;
&lt;p&gt;Use Gemini for conversation practice in your target language. &amp;quot;Let us practice Spanish. Ask me questions about my weekend and correct my grammar mistakes.&amp;quot; Use the Text to Speech API through the Google Cloud Console to hear native pronunciation. Upload articles in your target language to NotebookLM and ask for translations of specific passages.&lt;/p&gt;
&lt;h3&gt;Workflow Automation (Low Code)&lt;/h3&gt;
&lt;p&gt;Install the Gemini CLI. Create a simple script that checks your Gmail for specific types of messages and summarizes them. The free tier handles 1,000 requests per day, which is more than enough for personal use. You do not need to be a programmer to copy a script from the documentation and run it.&lt;/p&gt;
&lt;p&gt;Here is a concrete example. Set up a recurring task that uses Gemini CLI to read the headlines from a few news RSS feeds and summarize them into a morning briefing. Save the output to a Google Doc. The whole pipeline runs automatically without you touching anything.&lt;/p&gt;
&lt;h3&gt;Teaching and Tutoring&lt;/h3&gt;
&lt;p&gt;Use Gemini as a practice partner for learning. Studying for a certification exam? Ask Gemini to quiz you on each topic. Learning a new language? Have Gemini conduct a conversation and correct your mistakes. Preparing for a presentation? Practice your talking points and ask Gemini to critique your logic and suggest better examples.&lt;/p&gt;
&lt;p&gt;The key is to treat AI as a patient, always available practice partner that never gets bored or impatient. It will quiz you on the same topic fifty times if that is what you need.&lt;/p&gt;
&lt;h3&gt;Travel Planning&lt;/h3&gt;
&lt;p&gt;Use Gemini to research destinations, compare flight options, and create itinerary drafts. &amp;quot;Plan a 5 day trip to Lisbon focused on food and history. Include free activities for two of the days.&amp;quot; Then use NotebookLM to store all your research: hotel confirmations, restaurant recommendations, museum hours, and transportation options. One notebook per trip keeps everything organized and searchable.&lt;/p&gt;
&lt;h3&gt;Google Photos and Google Lens: AI You Are Already Using&lt;/h3&gt;
&lt;p&gt;Two more free AI tools worth mentioning because you probably use them without thinking about it.&lt;/p&gt;
&lt;p&gt;Google Photos uses AI for automatic categorization. You can search &amp;quot;dogs&amp;quot; and it finds every photo containing a dog. You can search &amp;quot;beach&amp;quot; and it finds vacation photos. You can search &amp;quot;birthday cake&amp;quot; and it surfaces party photos. The AI identifies faces, objects, locations, and even specific events in your photo library. All of this runs for free on your Google account with 15GB of storage.&lt;/p&gt;
&lt;p&gt;Google Lens is image recognition built into your phone&apos;s camera. Point it at a landmark and it identifies the building. Point it at a plant and it tells you the species. Point it at a document and it extracts the text. Point it at a product and it finds prices online. Lens uses the same underlying vision AI as Cloud Vision but through a free consumer interface. It is one of the most practical AI tools you already have in your pocket.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;When Free Is Not Enough&lt;/h2&gt;
&lt;p&gt;The free tier is generous, but it has limits. Here are the signs that you might benefit from a paid subscription.&lt;/p&gt;
&lt;p&gt;You hit rate limits regularly. Gemini gets slow during peak hours. You need to process very long documents frequently. You want to connect AI to your calendar, tasks, and other personal data. You need priority access for time sensitive work.&lt;/p&gt;
&lt;p&gt;Google&apos;s paid tiers start at $20/month for Google AI Premium (formerly Google One AI Premium), which gives you Gemini Ultra access, larger context windows, Google Workspace integration, and priority access. The $100/month AI Ultra plan adds higher usage limits and access to the latest models.&lt;/p&gt;
&lt;p&gt;But before you pay, ask yourself whether the free tools actually solve your problems. Many people pay for AI subscriptions out of FOMO and end up using the same features they had for free. Start with the free tier. Use it for a month. If you hit real limitations, then consider upgrading.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What Comes Next&lt;/h2&gt;
&lt;p&gt;This post covered more than a dozen free AI tools available through your Google account. The most useful for daily productivity are Gemini (chat and writing), NotebookLM (research), AI Studio (learning and experimentation), and the classic APIs for translation, transcription, and vision.&lt;/p&gt;
&lt;p&gt;Part 3 of this series covers ChatGPT and Claude, the two main competitors to Google&apos;s offerings. You will learn what you get at each paid tier, how to use desktop apps and advanced features like Clips and Dispatch, and whether the paid plans are worth it for your use case.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Continue to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Skip to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>ChatGPT and Claude: Which AI Service Should You Pay For</title><link>https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-3-chatgpt-and-claude-deep-dive/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-3-chatgpt-and-claude-deep-dive/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-3-chatgpt-and-...</description><pubDate>Mon, 01 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Part 2 of this series covered the extensive free AI tools Google offers through your Gmail account. Now we step up to the paid tier. ChatGPT from OpenAI and Claude from Anthropic are the two most popular paid AI assistants in 2026. Between them, they handle the vast majority of AI interactions worldwide.&lt;/p&gt;
&lt;p&gt;This is Part 3 of &amp;quot;Catching Up with Using AI for All Levels.&amp;quot; If you are just joining, start with Part 1 for the fundamentals of how AI works and Part 2 for the free tools. This post covers the paid side: what each service charges, what you get at each tier, the desktop apps and advanced features, and practical examples of how to use everything for daily productivity.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Return to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Skip to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Two Giants: ChatGPT and Claude&lt;/h2&gt;
&lt;p&gt;OpenAI launched ChatGPT in November 2022 and sparked the current AI boom. Anthropic launched Claude shortly after, positioning it as a safety focused alternative. In 2026, both companies have mature products with overlapping features and distinct strengths.&lt;/p&gt;
&lt;p&gt;The choice between them is not about which one is &amp;quot;better.&amp;quot; It is about which one fits your specific needs. They are more similar than different at the basic level, but their advanced features diverge significantly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;ChatGPT: Features and Pricing&lt;/h2&gt;
&lt;h3&gt;Free Tier&lt;/h3&gt;
&lt;p&gt;ChatGPT&apos;s free tier is more restrictive than Google&apos;s free offerings. You get access to GPT 5.0 Mini with text chat, limited file uploads, and basic image generation through DALL E integration. The free tier uses an older, smaller model and has rate limits that slow down during peak usage.&lt;/p&gt;
&lt;p&gt;You can use it for simple tasks: answering questions, drafting short text, basic brainstorming. But the free tier is designed to give you a taste, not a full experience. Most users hit the limits within a few sessions and consider upgrading.&lt;/p&gt;
&lt;h3&gt;ChatGPT Plus: $20 per Month&lt;/h3&gt;
&lt;p&gt;The $20 tier is where ChatGPT becomes genuinely useful. Here is what you get.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Access to GPT 5 series models.&lt;/strong&gt; GPT 5.1 and GPT 5.2 are the standard models available to Plus users. They are significantly more capable than the free tier model: better reasoning, longer context windows, better instruction following. GPT 5.4, the latest flagship model released in early 2026, is also available to Plus users with usage caps. You get a certain number of messages per day on the flagship model before it reverts to 5.2 for the rest of the day.&lt;/p&gt;
&lt;p&gt;The practical difference between the models is noticeable. GPT 5.4 handles complex multi step instructions more reliably, writes more coherent long form content, and makes fewer logical errors. For quick questions and simple tasks, 5.2 performs almost as well. The tiered model access means you save your limited flagship messages for the hardest tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Voice conversations.&lt;/strong&gt; The advanced voice mode on the mobile app supports real time conversation with emotional range. The model detects your tone and adjusts its response accordingly. You can interrupt it mid sentence. It laughs at appropriate moments. The voice quality is good enough for extended conversations, and the latency is low enough that the conversation feels natural.&lt;/p&gt;
&lt;p&gt;The voice mode has become a primary interface for many users. They dictate emails, brainstorm ideas verbally, practice presentations, and have the model read documents aloud. It is surprisingly effective for tasks where typing is inconvenient.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Image generation.&lt;/strong&gt; Built on DALL E 3 and the newer DALL E 4 model. You can generate images from text descriptions, edit existing images, and create variations. The quality is competitive with dedicated image tools for many use cases, though specialized tools like Midjourney still lead for artistic work.&lt;/p&gt;
&lt;p&gt;The image generation is deeply integrated into the chat interface. You can generate an image, discuss it with the model, request changes, and iterate without switching applications. This tight feedback loop is more efficient than using separate tools for conversation and image generation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File uploads and data analysis.&lt;/strong&gt; Upload PDFs, Word documents, Excel spreadsheets, and images. ChatGPT reads the content and can analyze it. You can ask it to find patterns in a spreadsheet, summarize a long PDF, or extract data from images. This feature works well enough for everyday analysis but hits limits with very large files or complex multi sheet workbooks.&lt;/p&gt;
&lt;p&gt;One underappreciated use: upload a messy CSV file and ask ChatGPT to clean it. &amp;quot;Remove duplicate rows. Standardize the date column to ISO format. Flag any rows with missing values.&amp;quot; ChatGPT processes the data and gives you a cleaned version. For non technical users, this replaces a dozen spreadsheet formulas.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Web browsing.&lt;/strong&gt; ChatGPT can search the internet for current information when you enable the browsing feature. It cites its sources. This is useful for research that requires up to date information, though the browsing is slower than using a search engine directly because the model needs to search, read, and synthesize in sequence.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Custom GPTs.&lt;/strong&gt; Create your own specialized versions of ChatGPT with custom instructions, specific knowledge files, and configured capabilities. The GPT Store offers thousands of community created GPTs for specific tasks: writing assistant, code reviewer, travel planner, fitness coach. Most of these are shallow wrappers, but a well made custom GPT can save significant setup time for recurring tasks.&lt;/p&gt;
&lt;p&gt;The best use of Custom GPTs is creating one for your own recurring workflows. A &amp;quot;Meeting Notes GPT&amp;quot; that always formats output the same way. A &amp;quot;Content Repurposer&amp;quot; that turns any document into social posts, a newsletter summary, and a blog outline. The time saved from not repeating instructions adds up.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Memory.&lt;/strong&gt; ChatGPT Plus includes persistent memory across sessions. You can tell the model facts about yourself once, and it remembers them in future conversations. &amp;quot;I prefer concise answers.&amp;quot; &amp;quot;I work in healthcare compliance.&amp;quot; &amp;quot;My team has seven people.&amp;quot; These details persist and shape future responses without you repeating them.&lt;/p&gt;
&lt;h3&gt;ChatGPT Pro: $200 per Month&lt;/h3&gt;
&lt;p&gt;The $200 tier removes most usage caps. You get unlimited access to GPT 5.4, priority during peak times, higher file upload limits, and priority access to new features. This tier is for power users who depend on ChatGPT for their daily work and hit the Plus tier&apos;s limits regularly.&lt;/p&gt;
&lt;p&gt;The main question to ask yourself: are you hitting the Plus limits more than once a week? If yes, Pro might be worth it. If no, save your money.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Claude: Features and Pricing&lt;/h2&gt;
&lt;h3&gt;Claude Free&lt;/h3&gt;
&lt;p&gt;Claude&apos;s free tier gives you access to Claude Sonnet, the mid tier model. It is more generous than ChatGPT&apos;s free tier in terms of model quality but still has rate limits. You can use it for text chat, file uploads (PDFs, images, documents), and basic analysis. The free tier works for light use: quick questions, document summaries, short writing tasks.&lt;/p&gt;
&lt;h3&gt;Claude Pro: $20 per Month&lt;/h3&gt;
&lt;p&gt;The Pro tier upgrades you to Claude Opus, Anthropic&apos;s best non Max model. Here is what you get.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude Opus access.&lt;/strong&gt; Opus is Anthropic&apos;s flagship model, comparable to GPT 5.4 in capability. It excels at reasoning, writing, and analysis. Many users report that Claude Opus produces more naturally flowing, better structured long form writing than ChatGPT. The difference is subjective and task dependent, but it is a real distinction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;200K token context window.&lt;/strong&gt; Claude&apos;s context window has been a differentiator since its early days. 200,000 tokens is roughly 150,000 words, or about 300 pages of text. You can paste an entire novel into a single prompt and ask questions about it. This is useful for analyzing large documents, comparing multiple files, or working with long codebases.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Projects.&lt;/strong&gt; A Claude feature that lets you organize conversations, documents, and custom instructions into dedicated workspaces. Each project can have its own knowledge base (uploaded documents), custom instructions, and conversation history. This is useful for ongoing work on a specific topic: a project for your book research, another for your job search, another for your side project. The projects feature is one of Claude&apos;s strongest organizational tools and something ChatGPT does not directly replicate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude Mobile App.&lt;/strong&gt; Full featured mobile app with voice input, image upload, and conversation sync across devices. The mobile experience is similar to ChatGPT&apos;s.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude Code.&lt;/strong&gt; Claude&apos;s coding assistant that runs in your terminal. It can read and write files, execute commands, and manage entire codebases. Claude Code is more integrated into the developer workflow than ChatGPT&apos;s code features because it operates directly in your terminal environment. It can create files, run your project&apos;s test suite, and fix errors automatically.&lt;/p&gt;
&lt;h3&gt;Claude Max: $200 per Month&lt;/h3&gt;
&lt;p&gt;Claude Max is Anthropic&apos;s top tier, and this is where Claude pulls ahead of ChatGPT for power users. Here is what makes it different.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5x to 20x higher usage limits.&lt;/strong&gt; You can use Claude Opus essentially without worrying about rate limits. For heavy daily users, this is the main reason to upgrade. Instead of rationing your messages, you use Claude freely throughout the day for every task, question, and document analysis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude Cowork.&lt;/strong&gt; This is the desktop agent feature. Claude Cowork runs on your computer and can see your screen, interact with applications, read and write files, and perform multi step tasks. It is like having an assistant that can actually use your computer.&lt;/p&gt;
&lt;p&gt;The practical difference from a regular chatbot is enormous. Normal chatbots can only process text you give them. Cowork can see your screen. You can ask it questions about anything visible on your monitor. &amp;quot;Which Slack message from the engineering channel is unresolved?&amp;quot; &amp;quot;What is the stock price shown in the browser tab on the left?&amp;quot; It reads the information directly from your screen.&lt;/p&gt;
&lt;p&gt;Cowork can also take actions. &amp;quot;Open the Chrome browser, navigate to my analytics dashboard, take a screenshot of the weekly active users chart, and save it to the Desktop.&amp;quot; Claude opens Chrome, types the URL, waits for the page to load, captures the screenshot, and saves it. No shortcuts, no automation scripts, no manual steps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dispatch.&lt;/strong&gt; Claude Dispatch is a remote control feature that connects the Claude mobile app to the Claude Desktop app. You send a task from your phone, and Claude executes it on your desktop computer. The desktop does the actual work: opening files, running commands, interacting with applications. Your phone just provides the instructions and receives the results.&lt;/p&gt;
&lt;p&gt;Dispatch is not a remote desktop. You do not see your desktop screen on your phone. You send a text instruction, and Claude sends back the results. It is asynchronous task delegation. You can check on progress from your phone, ask follow up questions, or send additional instructions to the same session.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Computer Use.&lt;/strong&gt; Claude can control your mouse and keyboard to interact with software just like a human would. It can open your browser, navigate to a website, fill out a form, and submit it. It can open your email client, compose a message, and send it. It can open your spreadsheet, enter data, and save the file.&lt;/p&gt;
&lt;p&gt;Computer Use works through screenshots and coordinate based clicking. Claude sees what is on your screen, decides what to click, and simulates the mouse movement and click. It is slower than a human for simple tasks but much faster for repetitive data entry or multi step workflows that require switching between applications.&lt;/p&gt;
&lt;p&gt;The current limitations: Computer Use can struggle with applications that have unusual interfaces, custom UI components, or pages that render differently on different screen sizes. It works best with standard web applications and common desktop software.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clips.&lt;/strong&gt; Clips are reusable automation templates built on top of Cowork and Dispatch. You create a Clip once by demonstrating the workflow. The Clip saves the sequence of steps. You can trigger it on demand or on a schedule.&lt;/p&gt;
&lt;p&gt;Common Clip examples: &amp;quot;Every morning, open my inbox, find emails from my direct reports, summarize any requests, and save the summary to my Desktop.&amp;quot; &amp;quot;Before each client meeting, open the CRM, find the account history, open the latest proposal, and compile a briefing document.&amp;quot; &amp;quot;Every Friday at 4 PM, check my task list for overdue items, open the related files, and report on status.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Priority access.&lt;/strong&gt; New features arrive first for Max subscribers. During high traffic periods, Max users get priority compute. For time sensitive work, this matters.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Desktop Apps: Beyond the Web Chat Interface&lt;/h2&gt;
&lt;p&gt;Both ChatGPT and Claude have desktop applications that go beyond the web interface. This is where the real productivity gains live.&lt;/p&gt;
&lt;h3&gt;ChatGPT Desktop App&lt;/h3&gt;
&lt;p&gt;The ChatGPT desktop app (available for macOS and Windows) provides a persistent chat window that stays on top of your work. Key features include:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Voice mode.&lt;/strong&gt; Hands free conversation with the model. You can dictate prompts and hear responses spoken aloud. Useful when cooking, driving, or doing tasks that occupy your hands.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Screen capture.&lt;/strong&gt; Take a screenshot of anything on your screen and send it directly to the chat. ChatGPT analyzes the image and responds. Use this for: &amp;quot;Explain this error message,&amp;quot; &amp;quot;Turn this chart into a table,&amp;quot; &amp;quot;Read this article and summarize it.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;App integration.&lt;/strong&gt; The desktop app integrates with other applications on your computer. You can select text in any app, press a shortcut, and have ChatGPT process it. This works with browsers, email clients, text editors, and most other software.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File access.&lt;/strong&gt; Open local files from your computer directly in the chat. No need to upload through a web interface.&lt;/p&gt;
&lt;h3&gt;Claude Desktop App&lt;/h3&gt;
&lt;p&gt;The Claude desktop app goes further than ChatGPT&apos;s because of the Cowork and Dispatch features. Key features include:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cowork mode.&lt;/strong&gt; Claude sees your screen and can interact with your applications. You can say &amp;quot;Open the spreadsheet in my Downloads folder, find the row where revenue dropped more than 10%, and explain what the data shows about that month.&amp;quot; Claude opens the file, reads it, finds the relevant data, and gives you an analysis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clips.&lt;/strong&gt; Clips are reusable automation templates. You record a workflow once as a Clip, and Claude replays it later. Common Clips include: &amp;quot;Summarize my morning emails,&amp;quot; &amp;quot;Create a meeting brief from the last three Slack messages and the calendar event,&amp;quot; &amp;quot;Find all unpaid invoices from last month and compile them into a report.&amp;quot; Once created, you can trigger a Clip from the mobile app via Dispatch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dispatch remote control.&lt;/strong&gt; This is Claude&apos;s signature feature in 2026. Here is how it works in practice. You are on the train commuting to work. You open the Claude mobile app and type: &amp;quot;For my 9 AM meeting, find the proposal draft in my Google Drive, check if the pricing section was updated, and put the latest version on my desktop so I can review it when I arrive.&amp;quot; Dispatch sends this task to your desktop at home or office. Claude Cowork opens Google Drive, finds the file, checks the version history, and places the document on your desktop. When you arrive, everything is ready.&lt;/p&gt;
&lt;p&gt;Dispatch works for longer running tasks too. Start a research project from your phone while waiting in line. Dispatch triggers Claude to start the work on your desktop. By the time you sit down, the research is done and the results are waiting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The always on laptop requirement.&lt;/strong&gt; Dispatch requires your desktop to be awake, unlocked, and running the Claude Desktop app. If your computer goes to sleep, Dispatch fails. Users typically set their computer to never sleep when they plan to use Dispatch remotely. This is a practical consideration. Closing your laptop at the end of the day means Dispatch stops working. Using a desktop Mac mini or keeping a work laptop powered on solves this, but it adds a power and security consideration.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Practical Daily Productivity Examples&lt;/h2&gt;
&lt;p&gt;Here are specific workflows using the paid features of ChatGPT and Claude.&lt;/p&gt;
&lt;h3&gt;Email Management (Claude Max Using Dispatch)&lt;/h3&gt;
&lt;p&gt;Set up a recurring Cowork task that runs every morning at 7 AM. Claude opens your email, reads the last 24 hours of messages, and summarizes them into a briefing document on your desktop. While you commute, check the summary on your phone. Dispatch any follow ups: &amp;quot;Reply to Sarah confirming the meeting time&amp;quot; or &amp;quot;Save the attached contract to the Projects folder.&amp;quot;&lt;/p&gt;
&lt;p&gt;The time savings compound. A single email triage session that used to take 15 minutes now takes 2 minutes. Over a month, that is over four hours saved.&lt;/p&gt;
&lt;h3&gt;Meeting Preparation (ChatGPT Pro)&lt;/h3&gt;
&lt;p&gt;Before a meeting, gather the relevant materials: the calendar invite, the previous meeting notes, any documents shared in the thread. Drag them all into the ChatGPT desktop app. Ask: &amp;quot;Summarize the key decisions from our last meeting. List the action items that were due today. Based on this agenda, what are the three most important topics we need to discuss?&amp;quot;&lt;/p&gt;
&lt;p&gt;The chat persists, so you can follow up during the meeting: &amp;quot;Take notes on this discussion and compare them to the agenda.&amp;quot; After the meeting: &amp;quot;Draft the meeting summary email based on our discussion.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Document Comparison (Claude Pro for the 200K Context)&lt;/h3&gt;
&lt;p&gt;You have three vendor proposals, each 40 pages. Upload all three to a single Claude Pro chat. Ask: &amp;quot;Compare these three proposals on pricing, delivery timeline, warranty terms, and cancellation policy. Create a comparison table ranked by total cost of ownership over three years.&amp;quot;&lt;/p&gt;
&lt;p&gt;The 200K token context handles all three documents in a single conversation. You can ask follow ups about specific sections: &amp;quot;In the second proposal, what is the penalty for late delivery?&amp;quot; &amp;quot;Which proposal has the most favorable payment terms?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Research and Content Drafting (Either Service, Both Work)&lt;/h3&gt;
&lt;p&gt;Use Claude Pro&apos;s Projects or ChatGPT&apos;s Custom GPTs to set up a dedicated workspace for a writing project. Upload your research documents, outline, and style guide. Each time you sit down to write, the project context is already loaded. You do not need to re explain your requirements.&lt;/p&gt;
&lt;p&gt;For long form content, use Claude&apos;s Projects with custom instructions. Set the instructions to match your writing style: &amp;quot;Write in short paragraphs. Use active voice. Start with a concrete example. Avoid listing features without explaining the benefit.&amp;quot; The project remembers these instructions across sessions.&lt;/p&gt;
&lt;h3&gt;Personal Knowledge Management (ChatGPT Memory)&lt;/h3&gt;
&lt;p&gt;Use ChatGPT&apos;s Memory feature to build a persistent knowledge base about your work and life. Save facts once and reference them later. Tell the model your preferred communication style, the names of your team members, the projects you are working on, and your weekly schedule.&lt;/p&gt;
&lt;p&gt;Over time, ChatGPT builds a profile that makes its responses more relevant to your specific situation. When you ask for meeting agenda suggestions, it knows who attends your meetings and what topics are active. When you ask for help drafting an email, it knows the recipient&apos;s context from previous conversations.&lt;/p&gt;
&lt;p&gt;The tradeoff is privacy. The memory persists across sessions, which means you are trusting OpenAI with personal and professional information. If that makes you uncomfortable, disable the memory feature and use a manual approach instead.&lt;/p&gt;
&lt;h3&gt;Expense and Receipt Processing (Claude Max Using Computer Use)&lt;/h3&gt;
&lt;p&gt;Take a photo of a stack of receipts. Send it to Claude via Dispatch. Claude&apos;s desktop agent opens your expense tracking spreadsheet or app, reads each receipt, categorizes the expenses, and enters the data. You review the results on your phone and approve the entries.&lt;/p&gt;
&lt;p&gt;This workflow turns a tedious 20 minute weekly task into a 30 second photo and approval step. The accuracy depends on receipt quality and handwriting legibility. Blurry or folded receipts may need manual correction.&lt;/p&gt;
&lt;h3&gt;Data Extraction from Screenshots (ChatGPT Desktop)&lt;/h3&gt;
&lt;p&gt;Take a screenshot of a table in a PDF that does not allow copy paste. Drag the screenshot into the ChatGPT desktop app. Ask: &amp;quot;Convert this table into a CSV format I can paste into Excel.&amp;quot; ChatGPT reads the image, extracts the data, and formats it correctly. This works for printed tables, screenshots of dashboards, and images of documents.&lt;/p&gt;
&lt;h3&gt;Schedule Management (Claude Pro + Calendar Integration)&lt;/h3&gt;
&lt;p&gt;Use Claude&apos;s integration with your calendar to plan your day. &amp;quot;What does my calendar look like this week? Identify the three most important meetings and suggest preparation steps for each. Flag any scheduling conflicts.&amp;quot; Claude reads your calendar, analyzes the events, and gives you a structured overview.&lt;/p&gt;
&lt;p&gt;For recurring meetings, set up a Project that contains the meeting context, attendee list, and agenda template. Before each meeting, ask Claude to review the previous meeting&apos;s notes from the project and draft the agenda for the current one.&lt;/p&gt;
&lt;h3&gt;Code Assistance (Claude Code or ChatGPT)&lt;/h3&gt;
&lt;p&gt;Both services handle code well, but they approach it differently. ChatGPT&apos;s code analysis works through the chat interface. You paste code and ask for changes. Claude Code runs in your terminal and can make changes directly to your files.&lt;/p&gt;
&lt;p&gt;For non developers, Claude Code is less relevant. But ChatGPT&apos;s code features still help: ask it to write a formula for your spreadsheet, create a simple script to rename files, or explain what a line of code in a document does.&lt;/p&gt;
&lt;h3&gt;Voice to Document (Claude Max)&lt;/h3&gt;
&lt;p&gt;Use Claude&apos;s voice mode to dictate a rough draft of a document while walking or commuting. Dispatch sends the audio transcript to your desktop, where Claude Cowork formats it into a proper document with sections, headers, and formatting. When you get to your desk, the draft is ready to edit.&lt;/p&gt;
&lt;p&gt;This workflow is particularly good for first drafts. Dictating removes the friction of staring at a blank page. The content will be rough, but rough content is much easier to edit than to create from nothing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Which One Should You Choose&lt;/h2&gt;
&lt;p&gt;The honest answer depends on what you need.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose ChatGPT Plus ($20) if you want:&lt;/strong&gt; The broadest set of features at the standard price point, including image generation, web browsing, and the GPT Store. ChatGPT&apos;s ecosystem is larger than Claude&apos;s, with more third party integrations and community created tools. If you want one service that does a bit of everything, ChatGPT is the safe pick.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Pro ($20) if you want:&lt;/strong&gt; Better long form writing, the 200K context window for working with large documents, and the Projects organizational system. Claude tends to produce more natural prose with less prompting, which matters if you do a lot of writing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose ChatGPT Pro ($200) if:&lt;/strong&gt; You depend on ChatGPT for daily work and consistently hit the Plus limits. You use the API heavily. You want priority access to everything.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Max ($200) if:&lt;/strong&gt; You want Dispatch, Cowork, and Computer Use. These features are unique to Claude and fundamentally change how you interact with AI. The ability to send tasks from your phone and have them executed on your desktop is a genuine productivity multiplier that no other service offers at this level. If you work across multiple devices and want AI that follows you between them, Claude Max is the clear winner.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Get both if:&lt;/strong&gt; You have a budget for it and the tasks justify the cost. Many power users maintain subscriptions to both. They use ChatGPT for quick queries, image generation, and web research. They use Claude for writing, document analysis, and remote tasks through Dispatch. The combined $40/month (or $400 if you go Max and Pro) covers almost every AI use case.&lt;/p&gt;
&lt;p&gt;But the honest truth is that most people do not need to pay $200 a month for either service. Start with one $20 subscription. Use it for a month. If you hit limits or want features you do not have, consider upgrading or adding the second service.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What Comes Next&lt;/h2&gt;
&lt;p&gt;Part 4 of this series moves beyond text assistants into specialized AI tools. We will cover music generation with Suno and Udio, video creation with Veo and Runway, image generation with Midjourney and DALL E, and how these creative tools can fit into your daily productivity workflow.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Return to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Continue to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Tour of Specialized AI Tools: Music, Video, Images, and More</title><link>https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-4-specialized-ai-tools/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-4-specialized-ai-tools/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-4-specialized-...</description><pubDate>Mon, 01 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The first three parts of this series covered general purpose AI assistants: the chatbots and writing tools that handle text based tasks. But AI in 2026 extends far beyond chat windows. A whole ecosystem of specialized tools creates original music, generates cinematic video, produces professional images, and designs presentations.&lt;/p&gt;
&lt;p&gt;This is Part 4 of &amp;quot;Catching Up with Using AI for All Levels.&amp;quot; If you are new here, start with Part 1 for the fundamentals and Part 2 for the free tools, then Part 3 for ChatGPT and Claude. This post covers the creative side: what the tools are, what they cost, how good the output actually is, and when they make sense for daily productivity rather than just artistic projects.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Return to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Return to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Image Generation: The Mature Category&lt;/h2&gt;
&lt;p&gt;Image generation is the most mature of the creative AI categories. It started with DALL E 2 in 2022 and has grown into a competitive market with multiple strong options at various price points. The quality has improved to the point where AI generated images are used professionally in marketing, publishing, and product design.&lt;/p&gt;
&lt;h3&gt;Midjourney&lt;/h3&gt;
&lt;p&gt;Midjourney remains the gold standard for artistic quality. It operates through Discord, which is both its strength and its biggest friction point. You join the Midjourney Discord server, type your prompt in a channel, and the bot generates images in the thread.&lt;/p&gt;
&lt;p&gt;The Discord interface has improved over time. The bot now supports private messaging, so you do not need to share your generations with strangers. The web gallery is well organized. But the overall experience still feels like a workaround for the lack of a native application.&lt;/p&gt;
&lt;p&gt;Midjourney&apos;s output quality is excellent. The model understands composition, lighting, color theory, and artistic style better than any competitor. Its strength is producing images that look like professional photography or illustration work. If you need a photorealistic product shot in a specific lighting setup, Midjourney delivers. Its weakness is that it struggles with precise text rendering and specific brand requirements. Do not ask Midjourney to generate an image with a specific word or logo displayed clearly. It will get close but not exact.&lt;/p&gt;
&lt;p&gt;Pricing starts at $10 per month for Basic (3 hours of GPU time, roughly 200 images) and goes up to $60 per month for Mega (60 hours of GPU time). The Standard plan at $30 per month is the sweet spot for regular users.&lt;/p&gt;
&lt;h3&gt;DALL E 4 (via ChatGPT)&lt;/h3&gt;
&lt;p&gt;DALL E 4 is OpenAI&apos;s latest image generation model, available through ChatGPT Plus ($20/month). It has improved significantly over DALL E 3, with better prompt adherence, more consistent anatomy, and improved text rendering. DALL E 4 can render short words and phrases legibly in images, a longstanding weakness of earlier AI image generators.&lt;/p&gt;
&lt;p&gt;DALL E 4&apos;s main advantage is integration. Because it lives inside ChatGPT, you can iterate naturally. Generate an image, discuss it with the model, request changes, and generate the next version, all in one conversation. This tight feedback loop makes it the most efficient image generator for most workflows, even if Midjourney produces better standalone results.&lt;/p&gt;
&lt;p&gt;For productivity, the ChatGPT integration is the killer feature. You can be drafting a presentation slide in ChatGPT, generate an accompanying image in the same conversation, and get suggestions for how to arrange both on the slide. No switching between tools. No copying prompts between applications.&lt;/p&gt;
&lt;h3&gt;Flux and Stability AI&lt;/h3&gt;
&lt;p&gt;Flux, created by Black Forest Labs (a team of former Stability AI researchers), has emerged as a strong open weights competitor. Flux Pro is available through Fireworks AI and other providers. It competes with Midjourney on quality while being accessible through developer friendly APIs. Flux is available in several variants: Flux Pro for highest quality, Flux Dev for faster generation, and Flux Schnell for rapid prototyping.&lt;/p&gt;
&lt;p&gt;The open weights nature of Flux means you can run it on your own hardware or through any provider that hosts it. This flexibility makes it popular with developers who want to integrate image generation into their own applications without per image API fees from a single vendor.&lt;/p&gt;
&lt;p&gt;Stability AI continues to develop Stable Diffusion, the open source image generation model. Stable Diffusion 4 is available in 2026 with strong quality and the advantage of running locally on consumer GPUs. For users who want privacy, offline access, and no subscription fees, Stable Diffusion remains the best option. The tradeoff is that running it locally requires a capable GPU and some technical setup.&lt;/p&gt;
&lt;h3&gt;Nano Banana&lt;/h3&gt;
&lt;p&gt;Nano Banana has gained attention in 2026 as a new contender. It produces high quality images with a simple interface and competitive pricing. The Pro version includes upscaling, inpainting, and style transfer. It is worth trying alongside Midjourney and DALL E to see which style fits your needs. Nano Banana&apos;s strength is its ease of use for non technical users who want good results without learning complex prompt engineering.&lt;/p&gt;
&lt;h3&gt;Who Should Use Image Generation&lt;/h3&gt;
&lt;p&gt;The most practical productivity use is creating visuals for presentations, social media, and internal documents. Instead of spending 30 minutes searching stock photo sites for the right image, you generate exactly what you need in 30 seconds.&lt;/p&gt;
&lt;p&gt;Business use cases include: product mockups for proposals, custom illustrations for blog posts, branded social media graphics, concept visualizations for client presentations, and placeholder images for website designs.&lt;/p&gt;
&lt;p&gt;The quality is good enough for professional use in most contexts, but you should still use real photography for anything that represents an actual product, person, or location. AI generated images have subtle tells that trained eyes notice.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Video Generation: The Fastest Evolving Category&lt;/h2&gt;
&lt;p&gt;Video generation has progressed faster than any other AI category in the last 18 months. We went from short, glitchy clips to multi minute videos with consistent characters, coherent motion, and usable quality.&lt;/p&gt;
&lt;h3&gt;Veo 3.1 (Google)&lt;/h3&gt;
&lt;p&gt;Veo 3.1 is widely considered the best overall AI video generator in 2026. It produces high resolution videos with strong prompt adherence, consistent character appearance across frames, and minimal artifacts. The improvement from Veo 2 to Veo 3.1 was dramatic, with much better motion coherence and fewer of the warping artifacts that plagued earlier video models.&lt;/p&gt;
&lt;p&gt;Veo is available through Google AI Studio with a free tier that lets you generate short clips for testing. Full access requires a Google AI Premium ($20/month) or AI Ultra ($100/month) subscription. The free tier is generous enough for experimentation and light use, making Veo the most accessible high quality video generator.&lt;/p&gt;
&lt;p&gt;Veo excels at text to video: describe a scene and get a video clip. It also supports image to video: upload an image and animate it. The image to video feature is particularly useful for creating short animations from still photographs or illustrations. You can take a product photo and generate a slow orbit around it, turning a static image into a dynamic product showcase.&lt;/p&gt;
&lt;p&gt;Veo&apos;s main limitation is creative control. You describe what you want and accept what you get. There is no way to fine tune the motion, adjust camera angles, or edit specific frames. For one shot generation where speed matters more than precision, Veo is the best choice.&lt;/p&gt;
&lt;p&gt;For productivity, Veo is useful for creating short explainer videos, social media content, and presentation clips. A 15 second product demo video that used to take a day to produce can now be generated in minutes. The quality is good enough for social media and internal use, though not yet at broadcast quality.&lt;/p&gt;
&lt;h3&gt;Runway Gen 4&lt;/h3&gt;
&lt;p&gt;Runway is the veteran of AI video generation, having launched Gen 1 in early 2023. Gen 4, released in late 2025, offers the most creative control of any video generator. It includes features like Motion Brush (paint movement onto specific areas of an image), Act One (transfer facial expressions from a reference video), and inpainting (edit specific areas of a generated video).&lt;/p&gt;
&lt;p&gt;The Motion Brush is Runway&apos;s standout feature. You upload an image, paint a brush stroke across the area you want to animate, and define the direction and speed of movement. Want smoke rising from a chimney in a still photo? Paint the chimney area and set the motion upward. Want water flowing in a river? Paint the river surface and set the flow direction. This level of granular control is unique to Runway.&lt;/p&gt;
&lt;p&gt;Runway&apos;s pricing starts at $15 per month for the Standard plan with limited credits (enough for roughly 50 short generations). The Pro plan at $35 per month gives more credits and higher resolution output. For heavy use, the Unlimited plan at $95 per month removes credit limits. The pricing is higher than Veo&apos;s bundled cost, but the additional control justifies the premium for professional use.&lt;/p&gt;
&lt;p&gt;Runway is the best choice when you need precise control over the output. If you need a specific camera movement, a particular character action, or an edit to an existing generation, Runway gives you the tools to iterate. Veo is better for one shot generation where you describe what you want and accept the result.&lt;/p&gt;
&lt;h3&gt;Kling AI&lt;/h3&gt;
&lt;p&gt;Kling, developed by Kuaishou (the company behind a major Chinese video platform), has emerged as a strong competitor. It offers high quality video generation at competitive prices, with particularly good results for character animation and cinematic shots.&lt;/p&gt;
&lt;p&gt;Kling uses a credit system with free trial credits and paid packs starting around $10. The quality is comparable to Veo and Runway for many use cases, though it lags slightly on text rendering and complex scene composition.&lt;/p&gt;
&lt;h3&gt;Who Should Use Video Generation&lt;/h3&gt;
&lt;p&gt;Video generation is still more of a content creation tool than a daily productivity tool for most people. The practical use cases are concentrated in marketing, content creation, education, and internal communication.&lt;/p&gt;
&lt;p&gt;A non obvious productivity use: creating quick tutorial videos for your team. Instead of writing a 3 page document explaining how to use a new process, generate a 60 second video walkthrough. The video is easier to consume and more likely to be watched than a document to be read.&lt;/p&gt;
&lt;p&gt;The current limitations are real. Videos longer than 30 seconds still struggle with consistency. Characters in the first frame may change appearance by the tenth frame. Complex action sequences produce artifacts. Text rendering in video is unreliable. You should budget time for multiple attempts and manual editing to get a usable result.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Music Generation: From Novelty to Useful&lt;/h2&gt;
&lt;p&gt;AI music generation has come into its own in 2026. The tools produce genuinely listenable songs with vocals, multiple instruments, and coherent structure.&lt;/p&gt;
&lt;h3&gt;Suno&lt;/h3&gt;
&lt;p&gt;Suno is the leading AI music generator. It generates complete songs with lyrics, vocals, and instrumentation from a text prompt. You describe the genre, mood, and subject, and Suno produces a full track with verses, choruses, and a bridge.&lt;/p&gt;
&lt;p&gt;Suno&apos;s free tier gives you a limited number of generations per day, enough for experimentation. The Pro plan at $10 per month gives 500 generations and commercial usage rights. The Premier plan at $30 per month gives 2,000 generations and priority processing.&lt;/p&gt;
&lt;p&gt;The output quality varies by genre. Pop, rock, electronic, and hip hop work well. Classical and jazz are less convincing. The vocals sound synthetic on close listening but pass for casual listening in the background. The instrumental quality is generally strong.&lt;/p&gt;
&lt;p&gt;The most practical productivity use for Suno is creating custom background music for videos, presentations, and internal content. Instead of searching royalty free music libraries for the right track, you generate a track that matches your specific needs: &amp;quot;Upbeat electronic background music, 120 BPM, no vocals, suitable for a tech product demo.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Udio&lt;/h3&gt;
&lt;p&gt;Udio is Suno&apos;s primary competitor. It produces comparable quality with a slightly different emphasis. Udio excels at creative exploration: you can remix existing songs, extend specific sections, and edit the structure more granularly than Suno allows.&lt;/p&gt;
&lt;p&gt;Udio&apos;s pricing is similar to Suno&apos;s, with a free tier and paid plans starting at $10 per month. The choice between Suno and Udio comes down to personal preference for the output style and the editing workflow you prefer.&lt;/p&gt;
&lt;h3&gt;AIVA&lt;/h3&gt;
&lt;p&gt;AIVA specializes in orchestral and cinematic music. If you need a string quartet arrangement, a film score style piece, or ambient orchestral background music, AIVA produces the most convincing results in this niche.&lt;/p&gt;
&lt;p&gt;AIVA has a free tier for limited generations. Paid plans start at $15 per month for higher quality output and commercial rights. It is less versatile than Suno or Udio but better within its niche.&lt;/p&gt;
&lt;h3&gt;Who Should Use Music Generation&lt;/h3&gt;
&lt;p&gt;Music generation is the most situational of the creative AI tools. If you create any kind of video content, presentations, podcasts, or social media posts, generating custom background music saves time and avoids copyright issues.&lt;/p&gt;
&lt;p&gt;The hidden productivity use is inspiration and mood setting. Generate a few short musical pieces for a creative project and use them as background while you work. The music sets a tone that helps you get into the right mental state for the task.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Presentation and Design Tools&lt;/h2&gt;
&lt;p&gt;Beyond the obvious image, video, and music generators, a category of AI tools focuses specifically on business productivity tasks.&lt;/p&gt;
&lt;h3&gt;Gamma&lt;/h3&gt;
&lt;p&gt;Gamma creates presentations, documents, and web pages from a single prompt. You describe what you need, and Gamma generates a complete deck with text, images, and layout. The output is good enough for internal presentations and early stage client work.&lt;/p&gt;
&lt;p&gt;Gamma&apos;s free tier allows a limited number of generations. The Pro plan at $16 per month removes most limits and adds higher resolution exports.&lt;/p&gt;
&lt;p&gt;The productivity gain is significant for anyone who regularly creates presentations. A deck that takes two hours to build manually takes five minutes with Gamma. The tradeoff is that the output looks like an AI generated deck: competent but generic. Gamma is best for first drafts that you then customize with your own branding and specific content.&lt;/p&gt;
&lt;h3&gt;Beautiful AI&lt;/h3&gt;
&lt;p&gt;Beautiful AI predates the current generative AI wave. It uses AI for layout and design recommendations rather than content generation. You add text and images manually, and the AI arranges them into professional looking slides.&lt;/p&gt;
&lt;p&gt;Beautiful AI complements Gamma well. Use Gamma to generate the first draft, then import it into Beautiful AI for layout refinement. The combination covers both content generation and visual polish.&lt;/p&gt;
&lt;h3&gt;Canva AI&lt;/h3&gt;
&lt;p&gt;Canva has integrated AI features across its entire platform. Magic Design generates complete designs from text prompts. Magic Eraser removes unwanted objects from images. Magic Expand extends image boundaries. Magic Write generates and edits text.&lt;/p&gt;
&lt;p&gt;Canva&apos;s AI features are available on the free tier with usage limits. The Pro plan at $13 per month removes most limits and adds brand kits, background removal, and premium templates.&lt;/p&gt;
&lt;p&gt;Canva AI is the most practical choice for non designers who need to create visual content regularly. The learning curve is minimal, the output quality is good, and the integrations with social media platforms streamline publishing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Audio: Transcription, Voice, and Sound&lt;/h2&gt;
&lt;h3&gt;Descript&lt;/h3&gt;
&lt;p&gt;Descript is an audio and video editor with AI transcription at its core. You upload an audio or video file, Descript transcribes it, and you edit the media by editing the text. Delete a sentence from the transcript and the corresponding audio is removed. Change a word and Descript regenerates the audio in the original speaker&apos;s voice.&lt;/p&gt;
&lt;p&gt;The workflow is simple. Import a recording of a meeting, podcast, or voiceover. Descript transcribes everything automatically, usually within a few minutes for a one hour recording. You see the full transcript with speaker labels and timestamps. Edit the transcript as you would edit a document: delete filler words, reorder sections, fix mispronounced terms. The audio and video update automatically to match your text edits.&lt;/p&gt;
&lt;p&gt;Descript also includes AI voice generation (Studio Sound), noise reduction, and filler word removal. The Studio Sound feature analyzes your recording and removes background noise, echo, and room tone. It is good enough to make a recording from a noisy coffee shop sound like it was recorded in a treated studio.&lt;/p&gt;
&lt;p&gt;The free tier covers basic transcription with limited exports. The Pro plan at $24 per month adds screen recording, unlimited transcription, and Studio Sound. The Business plan at $40 per month adds team features and brand voices.&lt;/p&gt;
&lt;p&gt;For productivity, Descript is invaluable for anyone who creates audio or video content. Editing a spoken recording by editing text is dramatically faster than working with audio waveforms. The filler word removal alone saves 20 minutes per hour of recording. For meeting recordings, Descript generates searchable transcripts that let you find any topic discussed in a one hour meeting within seconds.&lt;/p&gt;
&lt;h3&gt;ElevenLabs&lt;/h3&gt;
&lt;p&gt;ElevenLabs is the leading AI voice generation platform. It produces the most natural sounding synthetic voices available, with accurate emotion, pacing, and emphasis. The voice cloning feature lets you create a digital copy of your own voice from a short recording, as little as 30 seconds of audio.&lt;/p&gt;
&lt;p&gt;The quality has improved to the point where short AI generated voice clips are difficult to distinguish from human recordings. Longer passages still have subtle tells: slightly unnatural pacing, odd emphasis on certain words, and a lack of breath sounds at natural intervals. But for most practical purposes, the quality is sufficient.&lt;/p&gt;
&lt;p&gt;ElevenLabs pricing starts at $5 per month for the Starter plan with limited characters (roughly 30 minutes of generated speech). The Creator plan at $22 per month is suitable for regular use with longer character limits. The Pro plan at $99 per month is for high volume commercial use.&lt;/p&gt;
&lt;p&gt;Productivity use cases include: generating voiceovers for videos and presentations, creating audio versions of written content (your blog posts, newsletters, internal memos), adding narration to tutorials and training materials, and producing multilingual versions of existing audio content. ElevenLabs supports 29 languages with good quality across most of them.&lt;/p&gt;
&lt;h3&gt;A Note on Ethics&lt;/h3&gt;
&lt;p&gt;Voice cloning raises obvious ethical concerns. You should only clone a voice with the person&apos;s explicit consent. Using ElevenLabs to impersonate someone without permission is not just unethical. It could be illegal in some jurisdictions. The platform has safety measures in place, including voice authentication and content moderation, but the responsibility ultimately rests with the user.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Adobe Firefly and the Creative Suite Integration&lt;/h3&gt;
&lt;p&gt;Adobe has integrated AI into its Creative Cloud suite through Firefly, its generative AI engine. Photoshop includes Generative Fill and Generative Expand, which let you add or remove elements from an image with text prompts. Illustrator has Generative Recolor and text to vector graphics. Premiere Pro includes text based editing similar to Descript.&lt;/p&gt;
&lt;p&gt;Firefly is notable because it is trained on Adobe Stock images and openly licensed content, which means the output is cleared for commercial use. If you work in marketing, publishing, or any context where copyright ownership matters, Firefly&apos;s training data provenance gives it an advantage over models trained on scraped internet data.&lt;/p&gt;
&lt;p&gt;Firefly is included in existing Creative Cloud subscriptions. Photoshop users with a subscription get a certain number of generative credits per month. Additional credits are available for purchase.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Putting It All Together: A Creative AI Workflow&lt;/h2&gt;
&lt;p&gt;Here is how these tools work together for a real project.&lt;/p&gt;
&lt;p&gt;You need to create a product launch video for a new software feature. Start with Suno to generate background music: &amp;quot;Upbeat electronic track, 90 BPM, no vocals, 60 seconds with a clear crescendo at the 45 second mark.&amp;quot; Download the track.&lt;/p&gt;
&lt;p&gt;Use Midjourney to generate key visual frames: &amp;quot;A person using a laptop with a glowing screen, clean modern office, cinematic lighting, photorealistic.&amp;quot; Select the best images.&lt;/p&gt;
&lt;p&gt;Upload the images to Veo or Runway. Generate short animated clips from each image: &amp;quot;Camera slowly zooming in on the screen.&amp;quot; Combine the clips.&lt;/p&gt;
&lt;p&gt;Use ElevenLabs to generate a voiceover from your script. Import the voiceover, music, and video clips into Descript. Edit by editing the transcript. Fine tune the timing. Export the final video.&lt;/p&gt;
&lt;p&gt;The entire workflow takes two to three hours for a 60 second launch video. The same project with traditional tools would take a full day or more, depending on your skill level with each medium.&lt;/p&gt;
&lt;h3&gt;Internal Training Videos&lt;/h3&gt;
&lt;p&gt;Your company needs a short training video explaining a new expense reporting process. Start with Gamma to generate a presentation deck with the key steps. Use ElevenLabs to generate a voiceover from the deck text. Use Suno to generate background music. Use Runway&apos;s image to video feature to animate any static diagrams. Combine everything in Descript for final editing. Total time: two hours for a five minute training video that would have taken a day and a half with traditional tools.&lt;/p&gt;
&lt;h3&gt;Social Media Content Calendar&lt;/h3&gt;
&lt;p&gt;You manage social media for a small business. Each week you need 5 images, 5 captions, and maybe one short video. Use Midjourney or DALL E to generate consistent branded images. Set a style reference in your prompts to keep visual consistency across posts. Use ChatGPT to draft captions in your brand voice. Use Veo to generate one 10 second video clip per week showcasing a product or service. Use Canva AI to arrange everything into the correct dimensions for each platform. The weekly content that used to take 4 hours now takes 45 minutes.&lt;/p&gt;
&lt;h3&gt;Client Proposal with Visuals&lt;/h3&gt;
&lt;p&gt;You are preparing a client proposal. Write the content in ChatGPT or Claude. Generate relevant diagrams and concept images in DALL E or Midjourney. If the proposal involves a physical product, generate a short Veo animation showing the product from multiple angles. Combine everything in Canva or Gamma for the final presentation. The result is a professional, visually rich proposal that looks like it took days to produce, completed in a few hours.&lt;/p&gt;
&lt;h3&gt;Personal Photo Projects&lt;/h3&gt;
&lt;p&gt;For personal use, the free tiers of these tools cover most needs. Edit family photos with Photoshop&apos;s Generative Fill to remove photobombers or improve composition. Use Google Photos AI to organize and search your library. Use Suno&apos;s free tier to generate a custom song for a friend&apos;s birthday. Use Canva AI to design invitations, cards, and social media posts for personal events.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Real Productivity Question&lt;/h2&gt;
&lt;p&gt;The question to ask about any specialized AI tool is not &amp;quot;Can it generate what I need?&amp;quot; It is &amp;quot;Does it save me more time than it costs?&amp;quot;&lt;/p&gt;
&lt;p&gt;The cost is not just the subscription price. It is the time spent learning the tool, the time spent iterating on prompts to get the output you want, and the time spent fixing problems that the AI introduced.&lt;/p&gt;
&lt;p&gt;For a business user who creates presentations and social media graphics regularly, Canva AI and Gamma are clear wins. The time saved per task is dramatic, and the learning curve is shallow.&lt;/p&gt;
&lt;p&gt;For a content creator who publishes weekly videos, Descript and Runway are worth the investment. The production speed increase pays for the subscriptions many times over.&lt;/p&gt;
&lt;p&gt;For someone who generates music or images recreationally, the free tiers are generous enough that you can explore without commitment. Pay only when you hit the free limits and find yourself wishing for more.&lt;/p&gt;
&lt;p&gt;Part 5 of this series covers the most advanced territory: open source models that run on your own computer, agent frameworks like Hermes Agent, and coding tools like OpenCode. These tools require more setup but offer privacy, offline access, and capabilities that cloud services cannot match.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Return to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Return to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-5-going-advanced/&quot;&gt;Continue to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Going Advanced: Open Source Models, Hermes Agent, and Local AI</title><link>https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-5-going-advanced/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/ai-for-all-levels-june-1-5-going-advanced/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-5-going-advanc...</description><pubDate>Mon, 01 Jun 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-06-ai-for-all-levels-5-going-advanced/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is the final installment of &amp;quot;Catching Up with Using AI for All Levels.&amp;quot; Parts 1 through 4 covered the fundamentals, free tools, paid services, and specialized creative tools. This post goes deeper. We will explore the open source ecosystem: models you can download and run on your own computer, agent frameworks that automate complex tasks, and coding tools that work entirely offline.&lt;/p&gt;
&lt;p&gt;This is the most technical post in the series, but do not let that scare you. The tools have matured significantly in 2026. Installing and running a local AI model is easier than it was six months ago, and the benefits are real: privacy, offline access, no subscription fees, and unlimited usage after the initial hardware investment.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Return to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Return to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Return to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Why Run AI on Your Own Hardware&lt;/h2&gt;
&lt;p&gt;Before we get into the tools, it is worth understanding why someone would choose local AI over the convenience of cloud services.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Privacy.&lt;/strong&gt; When you use ChatGPT, Claude, or Gemini, your conversations are processed on someone else&apos;s servers. The companies store and analyze your data. For personal use, this might be acceptable. For business use, especially with sensitive or proprietary information, it is a deal breaker. Local models process everything on your computer. Nothing leaves your machine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Offline access.&lt;/strong&gt; Cloud services require an internet connection. Local models work anywhere: on a plane, in a remote area, in a secure facility with no external network access. If connectivity is unreliable where you live or work, local AI is the only option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No subscription fees.&lt;/strong&gt; The cloud services cost $20 or $200 per month. Local models cost nothing to use after you buy the hardware. If you do the math over three years, a $2,000 computer running local models is cheaper than $720 of ChatGPT Plus or $7,200 of Claude Max.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unlimited usage.&lt;/strong&gt; Cloud subscriptions have rate limits. Pro users hit them regularly. Local models have no rate limits. You can use them as much as you want, as fast as your hardware allows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Customization.&lt;/strong&gt; Open source models can be fine tuned, modified, and specialized for your specific needs. You can train a model on your own documents. You can adjust its behavior in ways that cloud APIs do not permit.&lt;/p&gt;
&lt;p&gt;The tradeoffs are performance and capability. Local models are slower than cloud models. They are less capable, especially at complex reasoning tasks. A 7 billion parameter model running on a laptop cannot match GPT 5.4 or Claude Opus. But the gap has narrowed significantly, and for many everyday tasks, local models are good enough.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Hardware You Need&lt;/h2&gt;
&lt;p&gt;Running local AI requires a capable computer. The most important component is the GPU (graphics processing unit), because AI inference runs much faster on GPUs than on CPUs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Minimum setup:&lt;/strong&gt; 8GB of RAM and a modern CPU. You can run small models (1 to 3 billion parameters) on CPU alone. They will be slow, taking 10 to 30 seconds per response, but they work. This is enough for basic text generation and summarization. Models like Phi 4 (14B) in 4 bit quantization can run on CPU with acceptable speed for occasional use.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good setup:&lt;/strong&gt; 16GB of RAM and a GPU with 8GB of VRAM, such as an RTX 3060 or RTX 4060. This handles 7 to 8 billion parameter models comfortably. Response times are 2 to 5 seconds. Most mid range gaming laptops and desktop GPUs from the last few years meet this requirement. This is the sweet spot for most users: you can run Llama 4 8B, Qwen 3.5 7B, or DeepSeek R1 7B with good performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Great setup:&lt;/strong&gt; 32GB of RAM and a GPU with 16 to 24GB of VRAM, such as an RTX 4090, RTX 5090, or a used RTX 3090. This handles 13 to 30 billion parameter models. Response times are under 2 seconds. Models like DeepSeek R1 14B, Qwen 3.5 32B, and Mistral Small 4 24B run well here.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overkill:&lt;/strong&gt; 64GB+ RAM and 48GB+ VRAM, such as dual RTX 5090s or professional GPUs like the NVIDIA A series. This handles 70+ billion parameter models, including the largest open weight models like Llama 4 405B or DeepSeek V4 671B (mixture of experts, so only part of the model activates at a time). Response times are comparable to cloud services for many tasks.&lt;/p&gt;
&lt;p&gt;The good news is that model quantization has improved dramatically. Quantization compresses models to use less memory with minimal quality loss. A 70 billion parameter model that used to require 140GB of VRAM with full precision can now run on 24GB with 4 bit quantization while retaining most of its capability.&lt;/p&gt;
&lt;h3&gt;What You Can Expect at Each Level&lt;/h3&gt;
&lt;p&gt;At the minimum level, you get a capable assistant for simple tasks. It can summarize short documents, draft emails, answer questions about general knowledge, and help with basic coding. The responses are slower but usable.&lt;/p&gt;
&lt;p&gt;At the good level, you get a solid daily driver. It handles most tasks well: document analysis, longer writing, moderate reasoning, and code generation. The speed is good enough for interactive use without frustration.&lt;/p&gt;
&lt;p&gt;At the great level, you approach cloud model quality for many tasks. Complex reasoning, multi step instructions, and long context windows work well. The difference between this and GPT 5.4 or Claude Opus is noticeable on hard tasks but acceptable for everyday use.&lt;/p&gt;
&lt;h3&gt;CPU Only: Is It Worth Trying?&lt;/h3&gt;
&lt;p&gt;If you do not have a GPU, you can still run local models on CPU. The experience depends on the model size and your patience. Small models (1 to 3B parameters) run at reasonable speed. Medium models (7 to 8B) run at 3 to 10 tokens per second on a modern CPU, which means 10 to 30 seconds per response. This is usable for batch processing or tasks where speed does not matter, but it is too slow for interactive conversation.&lt;/p&gt;
&lt;p&gt;The CPU path is worth trying to understand the ecosystem before investing in a GPU. Install Ollama, download a small model, and see what local AI feels like.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Ollama: The Easiest Way to Run Local Models&lt;/h2&gt;
&lt;p&gt;Ollama is the simplest tool for running local LLMs. It handles model downloading, management, and inference through a single command line interface. It also serves an OpenAI compatible API, which means any application that works with OpenAI&apos;s API can work with your local models.&lt;/p&gt;
&lt;h3&gt;Installation&lt;/h3&gt;
&lt;p&gt;Ollama runs on macOS, Linux, and Windows. Download the installer from ollama.com and run it. The installation takes about two minutes. After installation, Ollama runs as a background service.&lt;/p&gt;
&lt;h3&gt;Running a Model&lt;/h3&gt;
&lt;p&gt;Open a terminal and run:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama run llama4
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ollama downloads the model (a few gigabytes) and starts an interactive chat session. Type your questions and get responses from the local model. Type /exit to leave.&lt;/p&gt;
&lt;p&gt;The first download takes time depending on your internet speed. Subsequent runs use the cached model and start instantly.&lt;/p&gt;
&lt;h3&gt;Available Models&lt;/h3&gt;
&lt;p&gt;Ollama hosts hundreds of models. Here are the ones worth knowing about in 2026.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Llama 4 (Meta).&lt;/strong&gt; Meta&apos;s latest open model family ranges from 8 billion to 405 billion parameters. The 8B model runs on modest hardware and handles general conversation, summarization, and simple tasks well. The 70B and 405B models require powerful hardware but approach cloud model quality.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DeepSeek V4 and R1.&lt;/strong&gt; DeepSeek&apos;s models have been a sensation in the open source community. V4 is their general purpose model, comparable to GPT 4 class models. R1 is their reasoning model that shows its chain of thought before answering. Both are available in sizes from 7B to 671B (mixture of experts). The smaller distilled versions (7B and 14B) run on consumer hardware and punch above their weight class.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Qwen 3.5 and 3.6 (Alibaba).&lt;/strong&gt; Qwen models have consistently improved. The 3.5 series offers strong performance across coding, reasoning, and general tasks. Qwen 3.6, released in mid 2026, adds improved instruction following and longer context handling. The 7B and 14B versions are popular choices for local deployment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemma 4 (Google).&lt;/strong&gt; Google&apos;s open model family is designed for efficient inference. Gemma 4 26B performs well for its size and runs on a single 24GB GPU. The smaller Gemma 4 9B runs on 8GB hardware.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mistral Small 4 and Mixtral.&lt;/strong&gt; Mistral&apos;s models are known for efficiency. Mistral Small 4 (24B) punches above its weight. The Mixtral 8x22B mixture of experts model offers strong performance with efficient resource usage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phi 4 (Microsoft).&lt;/strong&gt; Microsoft&apos;s Phi series focuses on small, capable models. Phi 4 runs on modest hardware and handles coding and reasoning tasks surprisingly well for its size.&lt;/p&gt;
&lt;h3&gt;Practical Example&lt;/h3&gt;
&lt;p&gt;Install Ollama, download Llama 4 8B, and start using it as a local assistant. Ask it to summarize documents (paste the text directly), draft emails, explain concepts, or brainstorm ideas. The quality is not as good as GPT 5.4, but it is fast, private, and free.&lt;/p&gt;
&lt;p&gt;For better quality, try DeepSeek R1 14B. The chain of thought reasoning makes it more thorough for complex questions, and it runs well on 12GB of VRAM.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;LM Studio: A Graphical Interface for Local Models&lt;/h2&gt;
&lt;p&gt;If the command line is not your preference, LM Studio provides a graphical interface for running local models. It includes a model browser, download manager, and chat interface in a single desktop application.&lt;/p&gt;
&lt;p&gt;LM Studio also serves an OpenAI compatible API endpoint, just like Ollama. You can start the local server and point any application at &lt;code&gt;http://localhost:1234/v1&lt;/code&gt;. This is useful for connecting local models to other tools.&lt;/p&gt;
&lt;p&gt;The key advantage of LM Studio is LM Link, a feature that lets you access a model running on one computer from other devices on your network. You can run a large model on your powerful desktop and access it from your laptop. This uses Tailscale for secure tunneling and works across networks.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;DeepSeek: The Open Weight Powerhouse&lt;/h2&gt;
&lt;p&gt;DeepSeek deserves special attention because it changed the open source AI landscape. In early 2025, DeepSeek released R1, a reasoning model that matched OpenAI&apos;s best models at a fraction of the training cost. The company followed with V3 and V4, each improving on the last.&lt;/p&gt;
&lt;p&gt;DeepSeek models are open weight, meaning you can download the actual trained parameters and run them on your own hardware. This is different from open source models that release only the architecture and training code. Open weight models let you inspect, fine tune, and deploy the exact model that achieved specific benchmark results.&lt;/p&gt;
&lt;p&gt;DeepSeek V4 is available through various providers. You can access it through chat.deepseek.com for free with rate limits, through API providers for pay per use pricing, or download it for local deployment.&lt;/p&gt;
&lt;p&gt;The pricing advantage is substantial. DeepSeek&apos;s API costs roughly 10 to 20 times less than OpenAI&apos;s API for comparable quality. This makes it attractive for developers building applications that make many API calls, and for users who want to experiment with advanced models without committing to a $200 per month subscription.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;OpenCode: Terminal Based Coding Agent&lt;/h2&gt;
&lt;p&gt;OpenCode is an open source, provider agnostic coding agent that runs in your terminal. It is similar to Claude Code but works with any model provider, including local models.&lt;/p&gt;
&lt;h3&gt;Installation&lt;/h3&gt;
&lt;p&gt;OpenCode installs with a single command. On macOS: &lt;code&gt;brew install opencode&lt;/code&gt;. On Linux and Windows, the website provides installers. The setup takes under a minute.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;OpenCode works out of the box with cloud providers by default. To use it with a local model, create an &lt;code&gt;opencode.jsonc&lt;/code&gt; configuration file that points to your local endpoint.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;json {   &amp;quot;provider&amp;quot;: {     &amp;quot;local&amp;quot;: {       &amp;quot;name&amp;quot;: &amp;quot;Local Model&amp;quot;,       &amp;quot;npm&amp;quot;: &amp;quot;@ai-sdk/openai-compatible&amp;quot;,       &amp;quot;options&amp;quot;: {         &amp;quot;baseURL&amp;quot;: &amp;quot;http://localhost:11434/v1&amp;quot;       },       &amp;quot;models&amp;quot;: {         &amp;quot;deepseek-r1-14b&amp;quot;: {           &amp;quot;name&amp;quot;: &amp;quot;DeepSeek R1 14B&amp;quot;,           &amp;quot;modalities&amp;quot;: { &amp;quot;input&amp;quot;: [&amp;quot;text&amp;quot;], &amp;quot;output&amp;quot;: [&amp;quot;text&amp;quot;] }         }       }     }   } } &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This configuration tells OpenCode to use a model running on Ollama&apos;s local endpoint.&lt;/p&gt;
&lt;h3&gt;Daily Use&lt;/h3&gt;
&lt;p&gt;OpenCode operates in two modes. Interactive mode (&lt;code&gt;opencode&lt;/code&gt;) opens a terminal UI where you describe your task and OpenCode works through it step by step. One shot mode (&lt;code&gt;opencode run &amp;quot;Add error handling to the API calls in src/client.py&amp;quot;&lt;/code&gt;) executes a single task and exits.&lt;/p&gt;
&lt;p&gt;For developers, OpenCode replaces or supplements GitHub Copilot and Claude Code. It understands your project structure, reads and writes files, runs terminal commands, and manages git operations. The key advantage is provider flexibility: you can use a local model for simple tasks and switch to a cloud model for complex ones, all within the same tool.&lt;/p&gt;
&lt;p&gt;For non developers, OpenCode is less directly useful. But it powers other applications that may benefit you indirectly, such as automated document processing pipelines and data transformation tools.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Hermes Agent: The AI Orchestrator&lt;/h2&gt;
&lt;p&gt;Hermes Agent, created by Nous Research, is an open source AI agent framework that runs in your terminal. Think of it as an AI assistant that can use tools, remember context across sessions, and coordinate with other AI tools.&lt;/p&gt;
&lt;h3&gt;What Makes Hermes Different&lt;/h3&gt;
&lt;p&gt;Hermes is provider agnostic. It works with any LLM provider: OpenAI, Anthropic, OpenRouter, or local models via Ollama. You configure the provider once, and Hermes handles the rest.&lt;/p&gt;
&lt;p&gt;The persistent memory system is a standout feature. Hermes remembers facts across sessions. Tell it your preferences once, and they apply to all future conversations. This is similar to ChatGPT&apos;s memory feature but runs entirely on your machine with no data sent to a cloud service.&lt;/p&gt;
&lt;p&gt;The skill system lets you save reusable procedures. If you have a workflow you repeat often, you can save it as a skill, and Hermes runs it automatically when the relevant context appears. Skills are just markdown files that describe the workflow, making them easy to create and modify.&lt;/p&gt;
&lt;h3&gt;Installation&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;bash curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The setup walks you through configuration: choosing a provider, setting up tools, and configuring your preferred model.&lt;/p&gt;
&lt;h3&gt;Daily Use&lt;/h3&gt;
&lt;p&gt;Use Hermes in three ways.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Interactive chat.&lt;/strong&gt; Run &lt;code&gt;hermes&lt;/code&gt; to start an interactive session. You can ask questions, request tasks, and have multi turn conversations, just like ChatGPT or Claude, but with the ability to run commands, access files, and use tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One shot queries.&lt;/strong&gt; &lt;code&gt;hermes chat -q &amp;quot;Summarize the changes in the last three git commits&amp;quot;&lt;/code&gt; runs a single query and returns the result. This is useful for scripting and automation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scheduled tasks.&lt;/strong&gt; Hermes can run tasks on a schedule using cron. Set up a daily briefing that summarizes your calendar, email, and task list every morning. The output can be delivered to Slack, Telegram, email, or saved to a file.&lt;/p&gt;
&lt;h3&gt;Daily Productivity Examples for Non Developers&lt;/h3&gt;
&lt;p&gt;For readers who are not developers, Hermes might sound like a developer tool. It is, but its uses extend beyond coding. Here are concrete examples that anyone can use.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Research automation.&lt;/strong&gt; Save a research workflow as a Hermes skill. When you need to research a topic, run the skill, and Hermes searches the web, extracts key information, and compiles a summary. You do not need to manually search, copy, and paste. The skill remembers your preferred format: bullet points, paragraph summaries, or structured reports.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File organization.&lt;/strong&gt; Ask Hermes to organize your Downloads folder by file type, date, or project. &amp;quot;Find all PDFs modified in the last week and move them to a Research folder. Delete any .tmp files older than 30 days. Create subfolders by category based on file names.&amp;quot; Hermes executes the task using its terminal access, moving and organizing files according to your instructions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Meeting notes.&lt;/strong&gt; Record meeting notes as a Hermes skill. When you finish a meeting, run the skill and Hermes prompts you for key decisions, action items, and follow ups. It formats the output consistently and saves it to your notes folder with a timestamp and meeting title. Over time, you build a searchable archive of formatted meeting notes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Content repurposing skill.&lt;/strong&gt; Save a skill that takes a piece of content and produces versions for different platforms. Input: a blog post. Output: LinkedIn summary, Twitter thread, newsletter excerpt, and internal memo. Run it once, get all four formats. The skill defines the tone and length for each platform so you do not need to repeat instructions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weekly review.&lt;/strong&gt; Set up a scheduled Hermes task that runs every Friday at 4 PM. It reviews your week&apos;s activity, summarizes what you accomplished, and drafts a status report. You review and send. The cron based scheduling means the task runs automatically without you remembering to start it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The key insight is that Hermes remembers.&lt;/strong&gt; Unlike ChatGPT or Claude, which start fresh each conversation, Hermes saves skills and memories. Every hour you invest in setting up skills pays back in future sessions as tasks that used to take ten minutes now take one.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Privacy First: The Local AI Stack&lt;/h2&gt;
&lt;p&gt;The most interesting development in 2026 is the local AI stack: a fully private, offline setup that replaces cloud dependent tools. The standard combination is:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LM Studio&lt;/strong&gt; or &lt;strong&gt;Ollama&lt;/strong&gt; to run models locally.
&lt;strong&gt;OpenCode&lt;/strong&gt; for coding tasks, pointing at the local model.
&lt;strong&gt;Hermes Agent&lt;/strong&gt; for general purpose tasks and orchestration, also pointing at the local model.&lt;/p&gt;
&lt;p&gt;All three tools can use the same local model through Ollama&apos;s API. You install Ollama once, download a few models, and both OpenCode and Hermes connect to it automatically.&lt;/p&gt;
&lt;p&gt;The result is a fully private AI setup. Your data never leaves your machine. No subscriptions. No rate limits. No privacy concerns. The tradeoff is capability: local models are not as smart as the cloud frontier models. But for many everyday tasks, they are good enough.&lt;/p&gt;
&lt;p&gt;The author of a popular blog post about this setup summarized it well: &amp;quot;Hermes has an OpenCode skill, which means it can fire up OpenCode and interact with it.&amp;quot; The orchestrator delegates complex tasks to the specialized tool, and the whole system works together.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Open Source Image Generation&lt;/h2&gt;
&lt;p&gt;Local AI is not limited to text. Image generation models also run on your own hardware.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stable Diffusion 4&lt;/strong&gt; runs locally via AUTOMATIC1111&apos;s WebUI, ComfyUI, or InvokeAI. You download the model once and generate unlimited images with no per image costs. The quality is competitive with cloud services for many use cases.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Flux&lt;/strong&gt; is available in open weights through Black Forest Labs. The smaller Flux Dev model runs on 12GB VRAM and produces high quality images. Flux Schnell is a faster variant for rapid prototyping.&lt;/p&gt;
&lt;p&gt;Running image generation locally requires a GPU with sufficient VRAM. 8GB handles Stable Diffusion and Flux Schnell. 16GB handles Flux Dev and higher resolution outputs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Bottom Line&lt;/h2&gt;
&lt;p&gt;The open source AI ecosystem in 2026 is mature enough for daily use. You can run a capable language model, a coding assistant, an agent framework, and an image generator all on a single consumer grade computer, all for free after the hardware purchase.&lt;/p&gt;
&lt;p&gt;For anyone who values privacy or needs offline access. Anyone who wants unlimited usage without subscription fees. Anyone who enjoys tinkering with technology and wants full control over their AI tools.&lt;/p&gt;
&lt;p&gt;The privacy argument is the strongest for most users. Consider this: every prompt you type into ChatGPT becomes part of OpenAI&apos;s training data unless you opt out. Every document you upload to Claude is processed on Anthropic&apos;s servers. For personal use, you might be comfortable with this. For work related tasks, your employer&apos;s policies may prohibit sending data to third party AI services. Local models remove this concern entirely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Who should stick with cloud services?&lt;/strong&gt; Anyone who wants the best possible quality with zero setup effort. Cloud models are still smarter, faster, and more reliable than local alternatives. For mission critical work where quality matters most, the cloud is the better choice. If you are a writer producing polished content, a researcher synthesizing complex information, or a developer solving hard problems, the $20 per month for ChatGPT Plus or Claude Pro is money well spent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Who should try local?&lt;/strong&gt; Anyone who is curious, values privacy, works offline sometimes, or wants to avoid ongoing subscription costs. The setup takes an afternoon. The hardware costs money upfront but pays for itself over time. And the learning process itself is valuable: running a local model gives you a deeper understanding of how AI actually works, which is the whole point of this series.&lt;/p&gt;
&lt;p&gt;The best approach is hybrid. Use cloud services for the hard stuff: complex reasoning, long form writing, creative brainstorming. Use local models for everyday tasks: simple questions, document summaries, quick drafts, and anything involving sensitive data. Both have their place, and the tools now make it easy to switch between them.&lt;/p&gt;
&lt;h3&gt;Quick Decision Guide&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best quality, no setup&lt;/td&gt;
&lt;td&gt;ChatGPT Plus ($20/mo) or Claude Pro ($20/mo)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best quality, remote desktop control&lt;/td&gt;
&lt;td&gt;Claude Max ($200/mo) for Dispatch and Cowork&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy, offline, zero recurring cost&lt;/td&gt;
&lt;td&gt;Ollama + local models, free after hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best balance of privacy and capability&lt;/td&gt;
&lt;td&gt;Hybrid: cloud for hard tasks, local for sensitive work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer coding assistant&lt;/td&gt;
&lt;td&gt;OpenCode or Claude Code with your choice of model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full AI automation on your machine&lt;/td&gt;
&lt;td&gt;Hermes Agent with skills and scheduled tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;The Takeaway&lt;/h3&gt;
&lt;p&gt;You do not need to choose one approach exclusively. The best AI setup in 2026 uses multiple tools for different tasks. Free Google services for everyday queries. A $20 subscription for hard problems. Local models for sensitive work. Specialized tools for creative projects. The ecosystem is broad enough that there is a right tool for every task and a price point for every budget.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Where to Go From Here&lt;/h2&gt;
&lt;p&gt;This series covered the full spectrum of AI tools available in 2026. You started with the fundamentals in Part 1: how prediction engines, vectors, and transformers work. You moved to free tools in Part 2: Gemini, NotebookLM, and the Google AI ecosystem. Part 3 covered the paid powerhouses: ChatGPT and Claude with their desktop apps, Dispatch, and advanced features. Part 4 toured the creative side: image, video, music, and audio generation. This final part opened the door to the open source world: local models, agent frameworks, and complete privacy.&lt;/p&gt;
&lt;p&gt;The point of this series is not to sell you on any particular tool or approach. It is to give you a map of the landscape so you can choose what fits your needs. The best AI tool is the one you actually use. Start with the free tier. Add a paid subscription when you hit limits. Explore local models if privacy matters to you. Switch between tools depending on the task.&lt;/p&gt;
&lt;p&gt;AI is not magic and it is not sentient, as we covered in Part 1. It is a tool, like a search engine or a spreadsheet, but more versatile than either. The more you understand what it is and what it is not, the better you will use it.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-1-what-ai-is-and-isnt/&quot;&gt;Return to Part 1: What AI Is and Isnt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-2-getting-started-for-free/&quot;&gt;Return to Part 2: Getting Started for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-3-chatgpt-and-claude-deep-dive/&quot;&gt;Return to Part 3: ChatGPT and Claude Deep Dive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/ai-for-all-levels-4-specialized-ai-tools/&quot;&gt;Return to Part 4: Specialized AI Tools for Creation&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Platform Native AI Agent Tooling in 2026</title><link>https://iceberglakehouse.com/posts/data-platform-ai-agent-tooling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/data-platform-ai-agent-tooling/</guid><description>
# Data Platform Native AI Agent Tooling in 2026

&lt;!-- Meta Description: A comprehensive comparison of AI agent tooling across Dremio, Snowflake, Data...</description><pubDate>Sun, 31 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Data Platform Native AI Agent Tooling in 2026&lt;/h1&gt;
&lt;p&gt;&amp;lt;!-- Meta Description: A comprehensive comparison of AI agent tooling across Dremio, Snowflake, Databricks, Microsoft Fabric, AWS, Google Cloud, ClickHouse, VeloDB, SpiceAI, Bauplan, and Qlik in 2026. --&amp;gt;
&amp;lt;!-- Primary Keyword: Data platform AI agent tooling --&amp;gt;
&amp;lt;!-- Secondary Keywords: agentic analytics, data platform agents, MCP server data, lakehouse AI agents --&amp;gt;&lt;/p&gt;
&lt;p&gt;Every data platform vendor now offers some form of AI agent tooling. The approaches vary widely, from full agent authoring frameworks to MCP server endpoints to semantic layers designed for agent consumption. This article walks through eleven platforms and what each one offers for building, deploying, and running AI agents on your data.&lt;/p&gt;
&lt;p&gt;The common thread across all of them is the Model Context Protocol, or MCP. Anthropic released MCP in late 2024 as an open standard for connecting AI models to external tools and data sources. By mid-2026, nearly every major data platform ships an MCP server. MCP has become the default bridge between AI agents and enterprise data, replacing a patchwork of vendor-specific APIs.&lt;/p&gt;
&lt;p&gt;Here is how each platform approaches native AI agent tooling.&lt;/p&gt;
&lt;h2&gt;Dremio&lt;/h2&gt;
&lt;p&gt;Dremio took an early lead on agentic data access by shipping three integration paths in parallel.&lt;/p&gt;
&lt;p&gt;The first path is the built-in AI agent inside the Dremio console. You click a button and get a conversational interface that can discover data, generate visualizations, debug slow queries, and share insights. No setup, no configuration. The agent runs against Dremio&apos;s semantic layer, which means every query respects existing user permissions. Users see only the data they are authorized to access.&lt;/p&gt;
&lt;p&gt;The second path is the MCP server. Any MCP-compatible agent, including Claude Desktop, ChatGPT, or Gemini, connects to your Dremio lakehouse in minutes through a standardized endpoint. The MCP server exposes Dremio&apos;s full catalog, query engine, and semantic layer as tools that agents can discover and call. This is the path for business users and knowledge workers who work through chat-style interfaces.&lt;/p&gt;
&lt;p&gt;The third path is the Dremio CLI. It gives programmatic control to AI coding agents like Claude Code, Codex, or Cursor. The CLI is self-describing, meaning agents can discover Dremio&apos;s full capabilities and compose them into workflows. Use cases include automated data ingestion, table creation, schema transformations, query diagnosis, and end-to-end pipeline automation. The CLI installs via pipx, uv, or npm.&lt;/p&gt;
&lt;p&gt;Dremio also published its Agentic Lakehouse Architecture framework, which defines four technical layers for agent-ready data platforms: object storage, the Iceberg table format, the Polaris catalog for access control, and the query engine layer. The framework is useful for any team designing a data platform where AI agents are primary consumers.&lt;/p&gt;
&lt;h2&gt;Snowflake&lt;/h2&gt;
&lt;p&gt;Snowflake delivers agentic AI through two products: Cortex Agents and Snowflake Intelligence.&lt;/p&gt;
&lt;p&gt;Cortex Agents are Snowflake&apos;s framework for building AI agents that operate on your Snowflake data. A Cortex Agent combines Cortex Analyst for text-to-SQL, Cortex Search for hybrid search over unstructured data, and user-defined functions as callable tools. The agent uses a planner that breaks down complex questions into subtasks, executes them against Snowflake&apos;s compute layer, and reflects on results before responding.&lt;/p&gt;
&lt;p&gt;Snowflake Intelligence sits one level above Cortex Agents. It is the entry point for business users. You ask questions in natural language, and Intelligence routes them to the appropriate Cortex Agent or Cortex Search service. Intelligence also connects to external services through Snowflake&apos;s managed MCP server, which exposes Snowflake objects as tools that external agents can discover and call.&lt;/p&gt;
&lt;p&gt;The managed MCP server supports any MCP-compatible client. Snowflake also demonstrated multi-agent orchestration patterns that combine Snowflake Cortex with Microsoft AI Foundry agents, using MCP as the cross-platform bridge.&lt;/p&gt;
&lt;p&gt;Cortex Code, released in early 2026, extends agentic capabilities to coding workflows. It is an AI coding agent that works with local files and Snowflake data together, understanding context from both sides.&lt;/p&gt;
&lt;h2&gt;Databricks&lt;/h2&gt;
&lt;p&gt;Databricks built its agent tooling around Mosaic AI, which now includes a full agent development lifecycle.&lt;/p&gt;
&lt;p&gt;The starting point is the AI Playground, a no-code interface for prototyping and testing LLMs and agents with prompt engineering and parameter tuning. From there, you can build Knowledge Assistants for domain-specific Q&amp;amp;A chatbots that use Unity Catalog for data discovery and governance.&lt;/p&gt;
&lt;p&gt;For more complex scenarios, Databricks offers Agent Bricks. These are pre-built agent patterns including a Supervisor Agent that orchestrates multiple sub-agents, Genie Spaces, Unity Catalog functions, MCP servers, and custom Python agents. The Supervisor Agent is effectively a multi-agent orchestrator running inside Databricks.&lt;/p&gt;
&lt;p&gt;Custom agents are built with Python using the Databricks Agent Framework. The framework supports tool calling, RAG with Vector Search, and multi-agent coordination. Agents are deployed on scalable inference endpoints with built-in monitoring through MLflow Tracing.&lt;/p&gt;
&lt;p&gt;Databricks also supports MCP natively. You can register MCP servers as tools in Unity Catalog, and agents discover them through the catalog. This means your agent can call internal MCP servers alongside Databricks-native tools using a single discovery mechanism.&lt;/p&gt;
&lt;p&gt;The Unity AI Gateway provides governance across the entire stack, with usage tracking, payload logging, and security controls for LLMs and agents.&lt;/p&gt;
&lt;h2&gt;Microsoft Fabric&lt;/h2&gt;
&lt;p&gt;Microsoft&apos;s approach to data AI agents runs through two channels: Copilot in Fabric and Fabric Data Agents.&lt;/p&gt;
&lt;p&gt;Copilot in Fabric is integrated into every Fabric workload. In notebooks, it generates, refactors, and validates code with awareness of workspace context, schemas, and runtime state. In the data warehouse, it converts natural language to SQL and suggests completions. In Power BI, it builds reports from a topic description and writes DAX queries. In Real-Time Intelligence, it generates KQL queries for log and event data.&lt;/p&gt;
&lt;p&gt;Fabric Data Agents are a separate, more powerful capability released in late 2025. These are agents that operate on Fabric data with access to structured and unstructured content. They integrate with Microsoft 365 Copilot, so an agent working in Fabric can surface insights inside Teams or Outlook.&lt;/p&gt;
&lt;p&gt;Microsoft also provides a managed MCP server endpoint for Fabric. This lets third-party AI assistants query Fabric data through a standardized interface. At Ignite 2025, Microsoft demonstrated Fabric agents that use ontology models to understand business context and relationships across data sources.&lt;/p&gt;
&lt;p&gt;The deeper play is Azure AI Foundry. Foundry lets you build custom agents that orchestrate across Fabric, Azure OpenAI, AI Search, and external MCP servers. Snowflake and Microsoft jointly published a reference architecture for multi-agent orchestration using Cortex MCP with AI Foundry, showing how the platforms interoperate.&lt;/p&gt;
&lt;h2&gt;AWS&lt;/h2&gt;
&lt;p&gt;AWS centers its agent tooling on Amazon Bedrock, which has evolved significantly through 2025 and 2026.&lt;/p&gt;
&lt;p&gt;Amazon Bedrock Agents let you build generative AI applications that automate multi-step tasks. An agent receives a natural language request, breaks it into steps, calls the appropriate tools or APIs, and returns a result. Tools can be Lambda functions, data sources indexed by Knowledge Bases, or external APIs.&lt;/p&gt;
&lt;p&gt;The major 2026 addition is AgentCore. AgentCore is a platform layer for building, connecting, and optimizing AI agents. It is framework-agnostic, meaning you can deploy agents built with any framework and any model. AgentCore handles identity, access control, observability, and cost tracking across all your agents.&lt;/p&gt;
&lt;p&gt;AWS also announced Quick, a new AI assistant targeted at business users that works across AWS services. Quick joins Bedrock in providing both a managed AI assistant experience and a builder framework.&lt;/p&gt;
&lt;p&gt;At What&apos;s Next 2026, AWS pushed hard on vertical agents for healthcare, financial services, and supply chain. These are pre-built agent templates with domain-specific tools and knowledge bases, deployed through Bedrock.&lt;/p&gt;
&lt;p&gt;For data access, AWS provides MCP server implementations for S3, Glue, Athena, and Redshift. Agents discover these through the Bedrock tool registry and call them during execution.&lt;/p&gt;
&lt;h2&gt;Google Cloud&lt;/h2&gt;
&lt;p&gt;Google Cloud&apos;s agent story runs through Gemini Enterprise Agent Platform, the rebranded and expanded version of Vertex AI Agent Builder.&lt;/p&gt;
&lt;p&gt;The platform provides a no-code agent builder, a code-based agent SDK, and a tool governance system. You define agent behavior, connect tools, and deploy on Google&apos;s infrastructure. Tools include BigQuery, Discovery Engine for enterprise search, and third-party APIs registered through the Cloud API Registry.&lt;/p&gt;
&lt;p&gt;Vertex AI Agent Builder was originally focused on customer service and search agents. In 2026, Google expanded it to cover data analytics agents that can query BigQuery, Looker, and Spanner. These agents use Gemini&apos;s native SQL generation capabilities and can chain multiple queries to answer complex analytical questions.&lt;/p&gt;
&lt;p&gt;The Cloud API Registry is the tool governance layer. It lets platform teams register, version, and manage APIs that agents can call. This addresses the operational problem of agent tool sprawl, where agents accumulate dozens of undocumented tool dependencies.&lt;/p&gt;
&lt;p&gt;Google also ships an MCP server for BigQuery. It exposes BigQuery&apos;s full SQL interface, table metadata, and job management as MCP tools. Combined with Gemini&apos;s long-context window, this creates a usable pattern for analytical agents that iterate on SQL queries based on results.&lt;/p&gt;
&lt;p&gt;For security, Google enforces access controls through its IAM system. Agent tool calls respect the same permissions as direct user calls.&lt;/p&gt;
&lt;h2&gt;ClickHouse&lt;/h2&gt;
&lt;p&gt;ClickHouse took a pragmatic approach. Rather than building a proprietary agent framework, it shipped an open-source MCP server and a comprehensive set of integration guides.&lt;/p&gt;
&lt;p&gt;The ClickHouse MCP server, mcp-clickhouse, exposes three core tools: run_select_query, list_databases, and list_tables. These are intentionally few. ClickHouse&apos;s philosophy is that agents should interact with the database through SQL, not through a layer of abstraction. The MCP server gives agents schema discovery and query execution, and the agents decide what to do with that capability.&lt;/p&gt;
&lt;p&gt;ClickHouse published integration guides for 17 different AI agent frameworks, including LangChain, LlamaIndex, CrewAI, DSPy, OpenAI Agents, Microsoft Agent Framework, PydanticAI, and Chainlit. Each guide shows how to connect the framework to ClickHouse through MCP. This breadth of coverage reflects ClickHouse&apos;s position as an infrastructure layer that teams access through whatever agent framework they prefer.&lt;/p&gt;
&lt;p&gt;The ClickHouse Agent Skills repository provides packaged instructions for AI coding agents. These are markdown files that extend Claude Code, Cursor, and Copilot with ClickHouse domain knowledge covering schema design, query optimization, and data ingestion patterns.&lt;/p&gt;
&lt;p&gt;In ClickHouse Cloud, the remote MCP server is managed by ClickHouse. You do not need to host it yourself. Cloud customers connect their agents directly.&lt;/p&gt;
&lt;h2&gt;VeloDB and Apache Doris&lt;/h2&gt;
&lt;p&gt;VeloDB, the managed cloud platform powered by Apache Doris, positions itself as the analytics database for the agentic AI era.&lt;/p&gt;
&lt;p&gt;Doris provides the Apache Doris MCP Server, built with Python and FastAPI. The MCP server supports three transport modes: Server-Sent Events for real-time bidirectional communication, Streamable HTTP for large streaming queries, and Stdio for low-latency IDE integration with tools like Cursor.&lt;/p&gt;
&lt;p&gt;The Doris MCP server exposes tools including exec_query for core SQL execution, get_table_schema and get_table_column_comments for metadata discovery, and get_recent_audit_logs for compliance. The server includes smart connection pooling, automatic SQL safety checks, intelligent LIMIT enforcement, and query timeout controls.&lt;/p&gt;
&lt;p&gt;Doris also supports vector search through native ANN vector indexes. This enables hybrid search patterns where agents combine structured SQL filters with semantic similarity. Doris has LLM SQL functions for summarization, sentiment analysis, classification, and entity extraction, all callable from standard SQL.&lt;/p&gt;
&lt;p&gt;The Doris multi-catalog feature lets agents query data across MySQL, PostgreSQL, Iceberg, Hive, and S3 through a single MCP endpoint. This federated query capability is important for agents that need to reason across data silos without moving data.&lt;/p&gt;
&lt;p&gt;VeloDB&apos;s AI observability story is also strong. The same engine that serves agent queries can ingest and analyze agent logs, traces, and performance metrics. Teams use Doris as the backend for agent monitoring dashboards.&lt;/p&gt;
&lt;h2&gt;Spice AI&lt;/h2&gt;
&lt;p&gt;Spice AI takes a different approach from the large cloud vendors. Spice is an open-source SQL query and hybrid search engine, written in Rust, designed specifically as a data sidecar for AI applications and agents.&lt;/p&gt;
&lt;p&gt;Spice runs as a local or containerized process next to your agent. It connects to your data sources, accelerates queries through caching and materialization, and exposes a SQL interface plus vector search. The key design decision is that Spice sits close to the agent, not in the cloud. This gives sub-millisecond query latencies and offline capability.&lt;/p&gt;
&lt;p&gt;Spice provides its own MCP server, making it discoverable by any MCP-compatible agent. It also acts as a federated MCP client, meaning it can connect to external MCP servers and expose their tools to your agent through a single interface. This is useful when your agent needs data from multiple sources that each provide their own MCP server.&lt;/p&gt;
&lt;p&gt;The hybrid search capability combines SQL filtering with full-text search and vector similarity. Agents can ask questions that require both structured conditions and semantic matching in a single query.&lt;/p&gt;
&lt;p&gt;Spice AI&apos;s data acceleration layer caches query results and pre-computes aggregations. For agent workloads with repeated access patterns, this avoids hitting the source database for every query. Spice detects access patterns and optimizes the cache automatically.&lt;/p&gt;
&lt;h2&gt;Bauplan Labs&lt;/h2&gt;
&lt;p&gt;Bauplan is the most unusual entry on this list. It does not provide an AI agent. It provides infrastructure where AI agents build data pipelines safely.&lt;/p&gt;
&lt;p&gt;The core concept is Git for data. Bauplan gives you isolated branches of your Iceberg tables, each with its own commit history. An AI coding agent like Claude Code creates a branch, runs Python or SQL pipelines against it, verifies the results, and merges back to main. If something goes wrong, you roll back or branch from a historical commit.&lt;/p&gt;
&lt;p&gt;Bauplan Skills are the mechanism that makes this work. A Skill is a markdown file that describes a workflow: ingestion, data quality checks, pipeline creation, or debugging. When Claude Code starts in a Bauplan repo, it discovers the Skills and loads them as instructions. The Skills tell the agent how to interact with Bauplan&apos;s branch-based environment safely.&lt;/p&gt;
&lt;p&gt;The five core Skills are data assessment, safe ingestion with audit checks, data pipeline scaffolding, data quality checks with anomaly detection, and debug and fix pipeline. Each Skill enforces the branch-run-verify-merge loop.&lt;/p&gt;
&lt;p&gt;Bauplan&apos;s approach addresses a real problem. Coding agents are good at writing code. But data work is different from code work. Data is shared, does not have Git semantics, and downstream systems depend on it. Bauplan gives agents the safety rails to work on production data without breaking things.&lt;/p&gt;
&lt;h2&gt;Qlik&lt;/h2&gt;
&lt;p&gt;Qlik entered the agent space with a focus on the analytics workflow from discovery through action.&lt;/p&gt;
&lt;p&gt;Qlik Answers is the entry point. It combines structured analytics with unstructured content to provide contextual answers with follow-up reasoning. Below Answers sits the Semantic Layer, a shared set of business definitions that ensures consistent meaning across Qlik Answers, apps, and third-party assistants.&lt;/p&gt;
&lt;p&gt;Qlik ships four specialized agents. The Discovery Agent monitors key data areas and surfaces anomalies and changes. The Predict Agent builds ML models, generates predictions, and interprets results for forward-looking questions. The Automate Agent triggers workflows in downstream systems through natural language. The Analytics Agent supports analytics development tasks.&lt;/p&gt;
&lt;p&gt;Qlik also provides an MCP server that exposes its analytics capabilities to third-party AI assistants. This lets existing assistants use Qlik&apos;s calculations and data models without migrating to Qlik&apos;s native interface.&lt;/p&gt;
&lt;p&gt;The flow Qlik promotes is detect, investigate, predict, act. Data changes trigger the Discovery Agent, which passes signals to the Predict and Automate agents for action. This is a full agentic analytics pipeline inside a BI platform.&lt;/p&gt;
&lt;h2&gt;How to Choose&lt;/h2&gt;
&lt;p&gt;The platforms break into three rough categories.&lt;/p&gt;
&lt;p&gt;First are the full agent authoring platforms. Databricks with Mosaic AI, Snowflake with Cortex Agents, and AWS with Bedrock AgentCore let you build, deploy, and monitor custom agents that weave together data access and external tool calls. These are for teams that need to ship production agents with governance and observability out of the box.&lt;/p&gt;
&lt;p&gt;Second are the data access layer platforms. Dremio, Spice AI, ClickHouse, and VeloDB focus on making their query engines agent-accessible. They ship MCP servers and CLIs that let any agent discover and query data. These are for teams that already have an agent framework and need a reliable data connection layer.&lt;/p&gt;
&lt;p&gt;Third are the workflow and analytics platforms. Microsoft Fabric, Qlik, Bauplan, and Google Cloud occupy different parts of this space. Fabric and Qlik embed agents into existing analytics workflows. Google provides agent tooling alongside BigQuery. Bauplan provides safety infrastructure for agent-driven data engineering.&lt;/p&gt;
&lt;p&gt;Most teams will end up using platforms from multiple categories. A typical stack in 2026 might be Databricks for agent authoring, Dremio or Spice for multi-source data access, and Qlik for the analytics front end. The interoperability through MCP makes this practical in a way it was not two years ago.&lt;/p&gt;
&lt;h2&gt;Tradeoffs and Limitations&lt;/h2&gt;
&lt;p&gt;Agentic data tooling is still immature in several areas.&lt;/p&gt;
&lt;p&gt;Cost control is the biggest open problem. Agent queries are not like dashboard queries. An agent can issue dozens of exploratory queries before arriving at an answer, and each one consumes compute. Few platforms offer budget-aware query routing or cost caps on agent sessions.&lt;/p&gt;
&lt;p&gt;Evaluation is another gap. Databricks and Snowflake provide basic agent evaluation through MLflow and Cortex tracing respectively. But there is no standard way to measure whether an agent answered a data question correctly. Most teams rely on manual spot checks.&lt;/p&gt;
&lt;p&gt;Data quality is a hidden risk. An agent is only as good as the data it finds. If the catalog is poorly named, the schema is undocumented, or the data has known quality issues, the agent will produce confident but wrong answers. Semantic layers and data contracts help, but most organizations have not invested in these yet.&lt;/p&gt;
&lt;p&gt;MCP interoperability is not as seamless as the marketing suggests. Every MCP server implements the protocol slightly differently. Tool naming conventions vary. Authentication models differ. A LangChain agent that works with ClickHouse MCP out of the box may need custom wiring for Dremio MCP. The protocol standardizes the transport, not the semantics.&lt;/p&gt;
&lt;p&gt;Security models vary by implementation. Snowflake and Dremio enforce data access through their existing permission systems. Databricks uses Unity Catalog for governance. But not all MCP servers implement credential vending or scoped access. If you expose an unrestricted MCP endpoint, an agent can query anything it can reach.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data platform native AI agent tooling in 2026 means MCP servers, semantic layers, and agent authoring frameworks that understand data. The separation between data platforms and AI platforms is dissolving. If you are designing a data architecture today, you should assume that a significant fraction of queries in the next two years will come from AI agents, not human analysts.&lt;/p&gt;
&lt;p&gt;Start with MCP compatibility as a baseline requirement for any data tool you evaluate. Then decide whether you need a full agent authoring platform or a data access layer for your existing agent framework. The right answer depends on how much control you need over agent behavior and how much you already have invested in your agent stack.&lt;/p&gt;
&lt;p&gt;If you want to go deeper on these patterns and architectures, Alex Merced has written extensively on data platforms, AI agents, and modern data architecture. You can find his books at books.alexmerced.com.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Complete Guide to Agentic Coding Tools in 2026</title><link>https://iceberglakehouse.com/posts/agentic-coding-tools-may-2026/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/agentic-coding-tools-may-2026/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-agentic-coding-tools/).

The ter...</description><pubDate>Sun, 31 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-agentic-coding-tools/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The terminal is back. Not the green phosphor CRT kind, but the ethos. In 2026, the most interesting work in developer tooling happens at a command prompt, inside an IDE panel, or through a chat app you already have open. Agentic coding tools have exploded from a handful of experimental projects into a full ecosystem with hundreds of options, billions of API calls per month, and a pace of change that makes last year&apos;s roundups feel like ancient history.&lt;/p&gt;
&lt;p&gt;I track this space obsessively, across four distinct categories. Each solves a different problem. Each has its own tradeoffs. Here is the breakdown.&lt;/p&gt;
&lt;h2&gt;Coding CLI Agents: The Terminal Renaissance&lt;/h2&gt;
&lt;p&gt;The command line never went away, but it spent a decade playing second fiddle to graphical IDEs. That changed in late 2024 when Claude Code (then Claude Engineer) showed what a terminal-native agent could do. Now CLI coding agents are the fastest-growing segment of the developer tools market.&lt;/p&gt;
&lt;h3&gt;Claude Code&lt;/h3&gt;
&lt;p&gt;Anthropic&apos;s flagship coding agent runs entirely in your terminal. It reads your repo, writes files, runs shell commands, manages git branches, and opens pull requests. The 1 million token context window lets it hold your entire codebase in memory. Claude Code uses roughly 5.5 times fewer tokens than equivalent Cursor sessions, which matters when you are paying per token.&lt;/p&gt;
&lt;p&gt;It scores 80.9 percent on SWE-Bench Verified, the highest of any publicly available agent. The hook and plugin system lets you wire in custom validators, linters, or deployment scripts that fire before every commit. It costs $17-20 a month on Pro or $100-200 on Max. The caveat: you are locked into Claude. You cannot swap in another model.&lt;/p&gt;
&lt;h3&gt;OpenCode&lt;/h3&gt;
&lt;p&gt;With over 140,000 GitHub stars, OpenCode is the open-source alternative that refuses to be ignored. It supports 75-plus LLM providers through a unified adapter layer. Want Claude for reasoning and a local Qwen model for quick edits? OpenCode handles that. It runs multi-session workflows, has a plugin system called &amp;quot;SLIM,&amp;quot; and operates locally so your code never touches a server unless you want it to.&lt;/p&gt;
&lt;p&gt;The project moves fast, and that speed comes with occasional breakage. But for developers who want maximum model flexibility without vendor lock-in, OpenCode is the default choice.&lt;/p&gt;
&lt;h3&gt;OpenAI Codex CLI&lt;/h3&gt;
&lt;p&gt;Codex returned in 2025 as a lightweight, local-first agent tied to your ChatGPT subscription. It authenticates through your existing OpenAI account, so there is no separate billing. The cloud sandbox execution mode runs code in ephemeral environments, and the autonomous agent mode can work through multi-step tasks without handholding.&lt;/p&gt;
&lt;p&gt;Codex has extensions for VS Code, Cursor, and Windsurf, making it a hybrid between pure CLI and IDE integration. Its biggest weakness is model lock-in. You need GPT-5 series models to use it, though that also means you get the latest OpenAI capabilities the day they ship.&lt;/p&gt;
&lt;h3&gt;Aider&lt;/h3&gt;
&lt;p&gt;Aider is the veteran of the category, with 39,000 GitHub stars, 4.1 million installations, and 15 billion tokens processed per week. It auto-commits to git with sensible commit messages, works with over 100 languages, and supports Claude, GPT, DeepSeek, and local models via Ollama.&lt;/p&gt;
&lt;p&gt;The voice-to-code feature is surprisingly useful. Dictating &amp;quot;refactor this function to use async/await&amp;quot; while scrolling through code feels faster than typing it. Aider remains the gold standard for terminal pair programming, and it is completely free and open source.&lt;/p&gt;
&lt;h3&gt;Pi&lt;/h3&gt;
&lt;p&gt;Pi (pi.dev) positions itself as a security-first terminal agent. It runs in a sandboxed environment with granular file system permissions. Every tool call must be explicitly approved unless you configure trust rules. Pi is built for teams that need compliance without sacrificing agent capability.&lt;/p&gt;
&lt;p&gt;It supports multi-turn autonomous sessions, can browse the web, read documentation, and execute code in isolated containers. The tradeoff is speed. Approval toggles add friction compared to fully autonomous agents like Claude Code.&lt;/p&gt;
&lt;h3&gt;Goose&lt;/h3&gt;
&lt;p&gt;Goose started as an internal tool at Block (Square) and open-sourced under Apache 2.0. It transitioned to foundation governance under the Linux Foundation&apos;s Agentic AI initiative in early 2026, which gives it a neutrality that other projects lack.&lt;/p&gt;
&lt;p&gt;Goose is MCP-extensible, meaning any tool that speaks the Model Context Protocol can plug into it. It runs full development workflows — plan, code, test, commit — and is genuinely model-agnostic. The desktop companion app gives you a GUI without losing the CLI&apos;s power.&lt;/p&gt;
&lt;h3&gt;Gemini CLI&lt;/h3&gt;
&lt;p&gt;Google&apos;s entry is open source and offers the most generous free tier in the category: 1,000 requests per day with a Google account. That is effectively unlimited for most developers. The 1 million token context window matches Claude Code, and built-in web search grounding lets the agent pull documentation live.&lt;/p&gt;
&lt;p&gt;Gemini CLI supports conversation checkpointing, so you can pause a session and resume it later. The model router automatically picks Gemini 2.5 Pro for complex reasoning and Gemini 2.5 Flash for quick tasks. If Google keeps this free tier, it will be hard to beat for experimentation and learning.&lt;/p&gt;
&lt;h3&gt;GitHub Copilot CLI&lt;/h3&gt;
&lt;p&gt;The GitHub Copilot CLI emerged from public preview in 2026 and integrates deeply with the GitHub ecosystem. It references issues, browses pull requests, manages repos, and supports MCP tools. The default model is Claude Sonnet 4.5, but you can switch to GPT-5.&lt;/p&gt;
&lt;p&gt;The free tier gives 50 premium requests per month. Full access requires a Copilot subscription at $10-39 per seat. For teams already living inside GitHub, the integration is unmatched. For everyone else, the model flexibility of OpenCode or the cost of Gemini CLI looks better.&lt;/p&gt;
&lt;h3&gt;Amp&lt;/h3&gt;
&lt;p&gt;Sourcegraph&apos;s Amp offers a &amp;quot;deep mode&amp;quot; that uses GPT-5.2-Codex for extended autonomous research and implementation. It has composable subagents: Oracle for code analysis, Librarian for external library research, and Painter for image generation.&lt;/p&gt;
&lt;p&gt;The pricing is unusual. Amp is free, ad-supported, with a $10 per day API cost cap. Sourcegraph claims they add no markup on API costs, which makes Amp one of the most transparently priced tools on the market.&lt;/p&gt;
&lt;h3&gt;Warp&lt;/h3&gt;
&lt;p&gt;Warp is a full terminal replacement written in Rust with GPU acceleration. It runs multiple agents simultaneously — you can have Claude Code, Codex, and Gemini CLI all working in split panes. The built-in file editor and code review panel eliminate the need to alt-tab to an IDE.&lt;/p&gt;
&lt;p&gt;Warp claims its agent ships over 50 percent of its own pull requests. The WARP.md project configuration file lets you define project-specific agent behaviors. It is the right tool for developers who basically live in their terminal and want an all-in-one environment.&lt;/p&gt;
&lt;h3&gt;Augment CLI&lt;/h3&gt;
&lt;p&gt;Augment&apos;s enterprise context engine indexes your entire codebase: source code, dependencies, architecture, git history, even Slack threads about the code. The CLI agent uses this context to produce more accurate changes with fewer hallucinated imports.&lt;/p&gt;
&lt;p&gt;Augment scored first on SWE-Bench Pro and counts MongoDB, Spotify, and Webflow as customers. It is the most expensive option in this category, but for large codebases where context quality determines success, the cost is justified.&lt;/p&gt;
&lt;h3&gt;Roo Code / Kilo Code&lt;/h3&gt;
&lt;p&gt;Roo Code (formerly Roo Cline) and Kilo Code (formerly Kilocode) are both VS Code extensions that function as standalone CLI agents. Roo Code has a reputation for reliability on large multi-file changes -- &amp;quot;when other agents break down, use Roo&amp;quot; is a common sentiment.&lt;/p&gt;
&lt;p&gt;Kilo Code supports 500-plus models across 60-plus providers, has an orchestrator mode that breaks complex tasks into subagent workflows, and offers full transparency by showing every token and cost in real time. Both operate on pay-as-you-go pricing.&lt;/p&gt;
&lt;h3&gt;Crush&lt;/h3&gt;
&lt;p&gt;Crush runs on the Charm license and differentiates itself through cross-platform support that includes Android. You can run a coding agent on your phone. Mid-session model switching lets you start with an expensive reasoning model and swap to a cheaper execution model for the mechanical parts of the task. Granular permissions control which files and commands each session can access.&lt;/p&gt;
&lt;h3&gt;Kimi Code CLI&lt;/h3&gt;
&lt;p&gt;Moonshot AI&apos;s entry into the CLI agent category uses the Kimi K2.5 model, which achieves 84.34 percent on MMMU (beating Claude Opus 4.6 on multimodal reasoning). The CLI supports 100-agent swarm capability, meaning you can spin up a hundred agents to work on different parts of a codebase in parallel. This is overkill for most projects, but for massive refactors, it is something no other CLI agent offers.&lt;/p&gt;
&lt;h3&gt;Forge Code&lt;/h3&gt;
&lt;p&gt;Forge Code is a relative newcomer that focuses on agentic CI/CD pipelines. It generates code directly inside your GitHub Actions or GitLab CI workflows. When a test fails, Forge Code analyzes the failure, writes a fix, runs tests again, and commits the fix if everything passes. It is the only CLI agent designed to run inside CI rather than on your local machine.&lt;/p&gt;
&lt;h3&gt;Qwen Code&lt;/h3&gt;
&lt;p&gt;Alibaba&apos;s Qwen Code offers a completely free API, which is remarkable for a tool that scores around 70.6 percent on SWE-Bench. The 1 million token context window matches Claude Code. The catch is availability -- the free API has rate limits, and while Alibaba is clearly subsidizing it for market share, nobody knows how long that will last. For experimentation and learning, it is unbeatable value.&lt;/p&gt;
&lt;h3&gt;T3 Code&lt;/h3&gt;
&lt;p&gt;T3 Code is the free, open-source agent built on the T3 stack philosophy. It is designed for developers who want a working agent without paying for API keys or subscriptions. The tradeoff is that it defaults to local models, which means slower responses and lower capability compared to cloud-backed agents. For solo developers on a budget, T3 Code is worth a look.&lt;/p&gt;
&lt;h3&gt;iFlow&lt;/h3&gt;
&lt;p&gt;iFlow is a CLI agent built around the concept of SubAgents with controlled file permissions. You define which parts of your filesystem each subagent can read and write. This makes it suitable for monorepos where you want agents working on different packages to stay in their lanes. The permission system is more granular than anything in the category except Pi.&lt;/p&gt;
&lt;h3&gt;Amazon Q Developer CLI&lt;/h3&gt;
&lt;p&gt;Amazon Q Developer offers a free tier that is generous for AWS-heavy workflows. The CLI agent understands AWS services natively and can generate infrastructure code, debug Lambda functions, and query CloudWatch logs without you needing to context-switch. Outside of AWS, it is competent but not best-in-class.&lt;/p&gt;
&lt;h2&gt;UI-Based Tools: Desktop IDEs and Apps&lt;/h2&gt;
&lt;p&gt;Not everyone wants to live in the terminal. The desktop IDE category has evolved from autocomplete copilots into full agentic platforms that can build features from scratch, run tests, deploy, and even debug production issues.&lt;/p&gt;
&lt;h3&gt;Cursor&lt;/h3&gt;
&lt;p&gt;Cursor remains the most popular AI-first IDE. Its tab completion quality is still the best in the industry, and the February 2026 update added Computer Use, letting agents control the desktop and browser for GUI testing. The background agent mode spins up an isolated Ubuntu VM, clones your repo, and works on a dedicated branch.&lt;/p&gt;
&lt;p&gt;A typical pull request costs around $4-5 in background agent compute. Cursor priced at $16 per month for the base plan. The community is enormous, which means more tutorials, more extensions, and more people to ask when something breaks.&lt;/p&gt;
&lt;h3&gt;Windsurf&lt;/h3&gt;
&lt;p&gt;Windsurf introduced &amp;quot;Flows,&amp;quot; a persistent context mechanism that keeps the agent aware of your work across sessions. Unlike Cursor, which starts fresh each time, Windsurf remembers what you were working on, what decisions you made, and why you made them.&lt;/p&gt;
&lt;p&gt;The price increased from $15 to $20 per month in March 2026, which caused some grumbling. Windsurf still offers the best continuous context experience, and its multi-model support lets you pick the best model for each task.&lt;/p&gt;
&lt;h3&gt;Antigravity&lt;/h3&gt;
&lt;p&gt;Google&apos;s Antigravity IDE takes a different approach. Instead of a single agent, it spawns parallel agents that work on different parts of the codebase simultaneously. One agent implements the API endpoint while another writes the tests and a third updates the documentation.&lt;/p&gt;
&lt;p&gt;Antigravity includes a built-in Chrome instance for testing, which means the agent can visually verify UI changes without human intervention. The Pro tier costs $20 per month, and Ultra with unlimited parallel agents runs $250. It is the most ambitious IDE in the market, and it shows.&lt;/p&gt;
&lt;h3&gt;Claude Desktop&lt;/h3&gt;
&lt;p&gt;Anthropic&apos;s desktop app wraps Claude Code in a graphical interface. You get the same 1 million token context, the same agent capabilities, and the same model, but with a GUI that shows file diffs, session history, and tool outputs in a readable format.&lt;/p&gt;
&lt;p&gt;Claude Desktop integrates with your local file system and runs code directly on your machine. It is simpler than Cursor or Windsurf, but that simplicity is the point. You do not need to learn a new IDE to use it.&lt;/p&gt;
&lt;p&gt;Claude Desktop includes &lt;strong&gt;Dispatch&lt;/strong&gt;, a feature that lets you hand off long-running tasks to run in the background while you keep working. You tell Dispatch what needs done, and Claude picks up from where it left off whenever you reopen the app. It is not quite a 24/7 agent, but it is the closest thing to one that runs on your local machine. Close the laptop, reopen it later, and Dispatch resumes the task without you needing to re-explain anything.&lt;/p&gt;
&lt;h3&gt;Codex Desktop&lt;/h3&gt;
&lt;p&gt;OpenAI&apos;s desktop application mirrors Claude Desktop but for the GPT-5 series models. It runs on macOS and Windows and lets non-engineers dispatch coding tasks through a chat interface. The cloud sandbox executes code remotely, so you do not need a development environment.&lt;/p&gt;
&lt;p&gt;Codex Desktop has its own version of background execution. You can kick off a task -- refactor a module, add tests, update documentation -- and switch to other work while the agent keeps running in the cloud. The results appear as a pull request when done. Combined with the ChatGPT Pro subscription, this makes Codex Desktop a strong contender for teams that want async coding without managing infrastructure.&lt;/p&gt;
&lt;h3&gt;GitHub Copilot in VS Code&lt;/h3&gt;
&lt;p&gt;Microsoft&apos;s Copilot evolved from autocomplete into a full coding agent inside VS Code. The &amp;quot;Agent Mode&amp;quot; can create files, edit code, run terminal commands, and fix linter errors without switching context. It supports multiple models including Claude Sonnet 4.5 and GPT-5.&lt;/p&gt;
&lt;p&gt;Copilot is the default choice for millions of VS Code users because it ships with the editor. No separate install, no new IDE to learn. The weakness is that it trails purpose-built tools like Cursor on complex multi-file refactors.&lt;/p&gt;
&lt;h3&gt;Continue.dev&lt;/h3&gt;
&lt;p&gt;Continue is the open-source IDE extension that works with both VS Code and JetBrains. With 26,000 GitHub stars, it is the only tool in this category with full cross-editor support. You bring your own models — local via Ollama, cloud via any provider, or a mix of both.&lt;/p&gt;
&lt;p&gt;The tab completion quality is improving, and the slash command system lets you define custom workflows. Continue is not as polished as Cursor, but it is the most flexible option for developers who refuse to switch editors.&lt;/p&gt;
&lt;h3&gt;Cline (VS Code Extension)&lt;/h3&gt;
&lt;p&gt;Cline is the most installed open-source coding extension with 5 million downloads. It operates on a human-in-the-loop model: every file change, terminal command, or browser action requires explicit approval. This sounds slow, but for production codebases, the safety net is worth the friction.&lt;/p&gt;
&lt;p&gt;Cline supports browser automation, checkpoint rollback (undo any agent action), and MCP tools. The checkpoints feature alone has saved me from regenerating files that an overeager agent mangles.&lt;/p&gt;
&lt;h3&gt;Kiro (Amazon)&lt;/h3&gt;
&lt;p&gt;Amazon&apos;s Kiro takes a spec-driven development approach. Before it writes any code, it converts your prompt into EARS notation requirements. The agent then implements against those requirements, creating an auditable trail from request to implementation.&lt;/p&gt;
&lt;p&gt;Kiro has agent hooks that automate follow-ups — run tests on save, deploy on green, rollback on red. The free tier is generous, and the per-prompt credit pricing means you only pay for what you use.&lt;/p&gt;
&lt;h3&gt;Zed&lt;/h3&gt;
&lt;p&gt;Zed is a Rust-native editor that prioritizes speed above everything else. It launches instantly, renders at 120 frames per second, and its AI features are woven into the editor rather than bolted on as an extension. The inline diffs and multi-cursor editing are the best in the business.&lt;/p&gt;
&lt;p&gt;Zed supports Claude, GPT, and local models. It is the fastest editor in the category, but its smaller community means fewer plugins and integrations. If raw speed matters more than ecosystem size, Zed wins.&lt;/p&gt;
&lt;h3&gt;Replit Agent&lt;/h3&gt;
&lt;p&gt;Replit&apos;s agent works entirely in the browser. You describe what you want to build, and the agent creates files, installs dependencies, configures hosting, and deploys. It is the only tool on this list that does not require a local development environment.&lt;/p&gt;
&lt;p&gt;The agent handles deployment automatically, which makes it the best option for prototyping and MVP building. It is less suited for complex production codebases where you need fine-grained control over infrastructure.&lt;/p&gt;
&lt;h3&gt;Mistral Vibe&lt;/h3&gt;
&lt;p&gt;Mistral&apos;s entry into the desktop IDE category uses their Devstral 2 model, which scored 77 percent on SWE-Bench when running autonomously. The source code is Apache 2.0 licensed, so you can inspect and modify it. Paid plans start at $15 per month through Le Chat Pro.&lt;/p&gt;
&lt;p&gt;Devstral 2 is a 123-billion-parameter dense transformer specialized for agentic coding. It is one of the few coding models that performs as well in local deployment as in cloud, which matters for teams with privacy requirements.&lt;/p&gt;
&lt;h3&gt;Tabnine&lt;/h3&gt;
&lt;p&gt;Tabnine predates the current agentic coding wave and has evolved from a completion engine into a full agent. It supports context-aware code generation across your entire project, not just the file you are editing. Tabnine can run fully offline if you use its self-hosted models, and enterprise deployments get code that never leaves your infrastructure.&lt;/p&gt;
&lt;p&gt;The completions are fast, often faster than Cursor&apos;s, but the agent mode is less capable than newer tools. For teams that value privacy above all else, Tabnine is still the strongest option.&lt;/p&gt;
&lt;h3&gt;Codeium (Windsurf base)&lt;/h3&gt;
&lt;p&gt;Codeium was the company behind Windsurf before rebranding, but the core Codeium platform persists as a separate product for teams that want AI-powered completions without switching IDEs. It supports over 40 IDEs and editors, which is more than any competitor.&lt;/p&gt;
&lt;p&gt;The agent mode is less autonomous than Windsurf or Cursor, but the multi-IDE support makes it the default choice for polyglot teams that use a mix of editors.&lt;/p&gt;
&lt;h3&gt;PearAI&lt;/h3&gt;
&lt;p&gt;PearAI is a fork of VS Code with AI features baked in. It wraps multiple agent backends (Claude Code, Codex, OpenAI) behind a single interface. You pick the backend for each task. The philosophy is that no single model is best for everything, so the tool should let you choose without switching editors.&lt;/p&gt;
&lt;p&gt;The setup is more involved than Cursor because you need API keys for each backend. For developers who already have multiple model subscriptions, PearAI consolidates them without forcing you to pick one.&lt;/p&gt;
&lt;h3&gt;Lovable&lt;/h3&gt;
&lt;p&gt;Lovable (formerly GPT Engineer) targets a different audience. It is designed for non-developers who want to build web applications by describing them in natural language. The agent generates the full application, deploys it, and gives you a URL to share.&lt;/p&gt;
&lt;p&gt;Lovable handles the entire lifecycle from idea to deployment. The generated code is production-quality but generic. You get a working app fast, and customizing it later requires understanding the codebase Lovable generated.&lt;/p&gt;
&lt;h3&gt;Bolt.new&lt;/h3&gt;
&lt;p&gt;StackBlitz&apos;s Bolt.new runs entirely in the browser. You describe an application, and Bolt.new creates files, installs dependencies, and deploys to a preview URL, all inside a web container. No local setup, no IDE download.&lt;/p&gt;
&lt;p&gt;Bolt.new is the fastest way to go from idea to running prototype. It is not designed for existing codebases or enterprise projects, but for validating an idea in minutes, nothing else comes close.&lt;/p&gt;
&lt;h3&gt;v0 by Vercel&lt;/h3&gt;
&lt;p&gt;Vercel&apos;s v0 started as a UI generation tool and expanded into full-stack application generation. You describe a component or page, and v0 generates React/Next.js code with Tailwind styling. The agent mode can create multi-page applications with routing and data fetching.&lt;/p&gt;
&lt;p&gt;v0 is optimized for the Vercel ecosystem. If you deploy on Vercel and use Next.js, the generated code integrates naturally. Outside that stack, some features break.&lt;/p&gt;
&lt;h3&gt;Galileo&lt;/h3&gt;
&lt;p&gt;Galileo is unique in this category because it is built for data scientists and ML engineers rather than application developers. It generates Python data pipelines, visualization code, and ML training scripts. The agent understands pandas, NumPy, scikit-learn, PyTorch, and Jupyter notebooks.&lt;/p&gt;
&lt;p&gt;Galileo can execute code inline and display charts and tables in the chat interface. For data teams, it fills a gap that general-purpose coding agents handle poorly.&lt;/p&gt;
&lt;h2&gt;24/7 Autonomous Agents: Your Codebase Never Sleeps&lt;/h2&gt;
&lt;p&gt;The most interesting shift in 2026 is the move from interactive pair programming to asynchronous delegation. These agents live in your chat apps, accept tasks while you are away, and deliver results when you check back.&lt;/p&gt;
&lt;h3&gt;OpenClaw&lt;/h3&gt;
&lt;p&gt;OpenClaw is the largest open-source agent runtime by adoption with 369,000 GitHub stars and 3.2 million active users. It runs on Node.js, bridges 7-plus messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, WeChat), and routes tasks to any LLM backend.&lt;/p&gt;
&lt;p&gt;The sub-agent orchestration via the Agent Client Protocol (ACP) lets OpenClaw dispatch coding work to Claude Code, Codex CLI, or Cursor as sub-agents. The ClawHub marketplace has 44,000 community skills. Need an agent that monitors your AWS bill and DMs you when costs spike? There is a skill for that.&lt;/p&gt;
&lt;p&gt;OpenClaw runs on a single &lt;code&gt;npx openclaw&lt;/code&gt; command or a DigitalOcean one-click droplet for about $24 per month. The ecosystem includes KiloClaw ($49 per month managed hosting), NemoClaw (NVIDIA enterprise container), and ZeroClaw (Rust reimplementation for performance).&lt;/p&gt;
&lt;p&gt;The weakness is that self-hosting carries operational burden, and skill quality in the marketplace varies widely. For a non-profit project with no corporate backer (the creator joined OpenAI in February 2026), the momentum is remarkable.&lt;/p&gt;
&lt;h3&gt;Hermes Agent&lt;/h3&gt;
&lt;p&gt;Hermes Agent from Nous Research launched in February 2026 and grew to 64,000 GitHub stars in three months. It is a Python-based, self-improving agent harness. Every time it solves a problem, it generates a skill document so it can reuse that approach later without being told.&lt;/p&gt;
&lt;p&gt;The persistent cross-session memory uses FTS5 session search and LLM-curated memory with periodic nudges. Hermes connects to Telegram, Discord, Slack, WhatsApp, and Signal. It runs on local, Docker, SSH, Singularity, Modal, Daytona, and Vercel Sandbox.&lt;/p&gt;
&lt;p&gt;What sets Hermes apart is the learning loop. It builds a deep profile of your preferences and work patterns using Honcho dialectic user modeling. Over time, it gets better at predicting what you want before you ask. The built-in &lt;code&gt;hermes claw migrate&lt;/code&gt; tool lets you import configs from OpenClaw, which has made the two projects more complementary than competitive.&lt;/p&gt;
&lt;h3&gt;NemoClaw&lt;/h3&gt;
&lt;p&gt;NVIDIA&apos;s enterprise variant of OpenClaw wraps the agent runtime in a hardened container with TensorRT-LLM optimized inference. Multi-GPU support distributes inference across NVIDIA hardware for larger models. Data never leaves your infrastructure.&lt;/p&gt;
&lt;p&gt;NemoClaw is the only option on this list with automatic quantization, batching, and caching built in. It requires NVIDIA GPUs, which limits adoption, but for organizations that already run on NVIDIA hardware, the inference performance is unmatched.&lt;/p&gt;
&lt;h3&gt;KiloClaw&lt;/h3&gt;
&lt;p&gt;KiloClaw is the managed hosting layer for OpenClaw at $49 per month. It handles the deployment, monitoring, and updates so you do not have to maintain the infrastructure yourself. The value proposition is simple: OpenClaw&apos;s capabilities without the operations overhead.&lt;/p&gt;
&lt;p&gt;For teams that want OpenClaw&apos;s integration breadth but lack the DevOps bandwidth, KiloClaw is the bridge. Fifty dollars per month for a fully managed agent gateway is cheap compared to the engineering time needed to self-host.&lt;/p&gt;
&lt;h3&gt;AutoGen (Microsoft)&lt;/h3&gt;
&lt;p&gt;Microsoft&apos;s AutoGen framework takes a different approach. Instead of a single agent runtime, it is a multi-agent conversation framework where specialized agents collaborate on tasks. You define agents with different roles, tools, and models, and AutoGen manages the conversation flow between them.&lt;/p&gt;
&lt;p&gt;AutoGen is less turnkey than OpenClaw or Hermes. You write code to define agent behavior. But for complex workflows where different agents need different capabilities, it offers the most flexibility. The ecosystem includes templates for common patterns: code generation agent plus review agent plus test agent.&lt;/p&gt;
&lt;h3&gt;CrewAI&lt;/h3&gt;
&lt;p&gt;CrewAI is similar to AutoGen but opinionated toward role-based agent crews. You define a crew with a manager and workers, each with specific responsibilities and tools. The manager agent decomposes tasks and assigns them to workers.&lt;/p&gt;
&lt;p&gt;CrewAI is easier to get started with than AutoGen because the role abstraction maps naturally to how teams think about work. The tradeoff is less control over conversation dynamics. For straightforward delegation patterns, CrewAI is the better choice.&lt;/p&gt;
&lt;h3&gt;LangGraph Agents&lt;/h3&gt;
&lt;p&gt;LangChain&apos;s LangGraph framework adds structured workflow graphs to autonomous agents. Instead of letting the agent figure out the sequence of steps, you define a graph of nodes (tasks) and edges (transitions). The agent navigates the graph, executing nodes and deciding which path to take based on results.&lt;/p&gt;
&lt;p&gt;LangGraph shines for workflows where certain steps must happen in order. A code generation workflow might have: plan, implement, test, review, deploy. Each phase has different tools and success criteria. The graph structure enforces the sequence without hardcoding logic.&lt;/p&gt;
&lt;h3&gt;Paperclip Agent&lt;/h3&gt;
&lt;p&gt;Paperclip is a newer entrant focused on single-purpose autonomous agents. Instead of building a general-purpose agent that can do anything, Paperclip lets you spawn specialized agents for specific tasks: a PR reviewer agent, a dependency update agent, a documentation sync agent.&lt;/p&gt;
&lt;p&gt;Each Paperclip agent runs on its own schedule, monitors its trigger conditions, and executes only its designated function. The architecture keeps agents simple and reliable. If a PR reviewer agent breaks, the dependency updater keeps running. Paperclip is the microservices approach to agent architecture.&lt;/p&gt;
&lt;h3&gt;Claude Code Channels&lt;/h3&gt;
&lt;p&gt;Anthropic&apos;s research preview extends Claude Code into messaging platforms via MCP plugins. Your Claude Code agent lives in Telegram, Discord, or iMessage and executes code on your local development machine. It inherits all Claude Code features: skills, agents, MCP tools, and the full 1 million token context.&lt;/p&gt;
&lt;p&gt;Code Channels requires Anthropic Max ($100-200 per month). The agent stops if Claude Code stops, so it is session-bound rather than truly 24/7. But for developers who already pay for Claude and want mobile access to their coding agent, it fills a specific gap.&lt;/p&gt;
&lt;h3&gt;Devin&lt;/h3&gt;
&lt;p&gt;Cognition&apos;s Devin was the first &amp;quot;AI software engineer&amp;quot; to capture mainstream attention, and it has matured into a production tool used by Goldman Sachs in a hybrid workforce model of 12,000 human developers plus agents.&lt;/p&gt;
&lt;p&gt;Devin spins up a full cloud VM with browser, terminal, and editor. You assign tasks via Slack or web UI, and Devin delivers a pull request with tests and documentation. The pricing is $20 per month for Core plus ACU compute at $9 per hour of active work. The team plan runs $500 per month with 250 ACUs.&lt;/p&gt;
&lt;p&gt;Devin is the most polished cloud agent, but it is also the most expensive for heavy usage. The code leaves your infrastructure, which is a blocker for some enterprises.&lt;/p&gt;
&lt;h3&gt;Cursor Background Agents&lt;/h3&gt;
&lt;p&gt;Cursor&apos;s background agent mode uses an isolated Ubuntu VM that clones your repo and works on an &lt;code&gt;agent/&lt;/code&gt; branch. The February 2026 upgrade added Computer Use, letting the agent test GUI changes by controlling a desktop environment.&lt;/p&gt;
&lt;p&gt;Multiple agents can work in parallel, and a typical pull request costs around $4-5 in compute. The downside is that it is tied to Cursor IDE, so you need to run Cursor for background agents to function.&lt;/p&gt;
&lt;h3&gt;GitHub Copilot Coding Agent&lt;/h3&gt;
&lt;p&gt;The Copilot Coding Agent works directly from GitHub issues. You assign an issue, and the agent creates a branch, implements the feature, writes tests, and opens a pull request. No context switching, no explanation needed.&lt;/p&gt;
&lt;p&gt;Pricing runs $10-39 per seat per month depending on the plan. GitHub is switching to usage-based billing in June 2026, which will change the cost calculus. The agent works best for well-scoped issues like bug fixes, tests, and documentation. Complex architectural changes still need human guidance.&lt;/p&gt;
&lt;h3&gt;Jules (Google)&lt;/h3&gt;
&lt;p&gt;Google&apos;s Jules runs on Gemini 2.5 Pro and integrates with GitHub. It clones your repository into Google Cloud VMs, implements changes, and opens pull requests. While in free preview, it has no production dependency guarantee yet.&lt;/p&gt;
&lt;p&gt;Jules is the most generous cloud agent in terms of cost, but it is also the least mature. The Gemini-powered reasoning is strong, and the free tier makes it worth trying. Relying on it for production work is premature.&lt;/p&gt;
&lt;h3&gt;OpenAI Codex Cloud Agents&lt;/h3&gt;
&lt;p&gt;Beyond the CLI version, OpenAI runs cloud-hosted agents inside sandboxed environments via ChatGPT or the API. Token-based pricing at $1.50 per million input tokens and $6 per million output tokens through the &lt;code&gt;codex-mini-latest&lt;/code&gt; model.&lt;/p&gt;
&lt;p&gt;Codex cloud agents support multi-agent runs and can handle long autonomous sessions. The desktop app (macOS and Windows) wraps these capabilities in a GUI. For teams already in the OpenAI ecosystem, this is the most natural extension of their existing workflow.&lt;/p&gt;
&lt;h3&gt;OpenHands&lt;/h3&gt;
&lt;p&gt;OpenHands (formerly OpenDevin) is an open-source platform for autonomous coding agents. It operates in a Docker sandbox with a web interface, terminal, and file explorer. Agents can write code, run commands, browse the web, and interact with APIs.&lt;/p&gt;
&lt;p&gt;The project focuses on reproducibility and safety. Every agent action is logged, containerized, and auditable. It does not have the polish of Devin or the scale of OpenClaw, but for teams that want full control over agent behavior and data, OpenHands is a strong choice.&lt;/p&gt;
&lt;h2&gt;Model Routers: The Plumbing Layer&lt;/h2&gt;
&lt;p&gt;Every agent needs a brain, and the model router is the switchboard that connects agents to the right model at the right time. This category has grown from simple API proxies into intelligent routing systems that optimize for cost, latency, and capability simultaneously.&lt;/p&gt;
&lt;h3&gt;OpenRouter&lt;/h3&gt;
&lt;p&gt;OpenRouter is the most widely used model router with the largest model catalog. It provides one unified API for every major model provider and many smaller ones. You send a request using the OpenAI SDK format, and OpenRouter routes it to the model you specify.&lt;/p&gt;
&lt;p&gt;The v2 &amp;quot;Smart Routing&amp;quot; feature automatically picks the cheapest model that meets your requirements based on capability tags. Semantic caching reuses responses for similar queries, reducing costs by up to 60 percent. OpenRouter handles fallback logic, so if one provider is down, traffic routes to another.&lt;/p&gt;
&lt;p&gt;OpenRouter processed billions of tokens per day as of early 2026. It is the default model router for most open-source agent projects including OpenCode, Hermes, and Cline. The free tier includes access to 27 models with no credit card required.&lt;/p&gt;
&lt;h3&gt;Nous Portal&lt;/h3&gt;
&lt;p&gt;Nous Research&apos;s model gateway is integrated into Hermes Agent and provides access to 200-plus models. It optimizes for agentic workflows specifically: chain-of-thought traces, tool call formatting, and structured output are first-class concerns, not afterthoughts.&lt;/p&gt;
&lt;p&gt;The Portal supports custom endpoint configuration and OpenRouter as a fallback. It is designed for developers who want fine-grained control over model selection for different task types. Complex reasoning routes to expensive models, while file operations use cheaper local models.&lt;/p&gt;
&lt;p&gt;Nous Portal is younger than OpenRouter but growing fast because it ships with Hermes Agent by default. If you run Hermes, you are already using it.&lt;/p&gt;
&lt;h3&gt;OpenCode Zen&lt;/h3&gt;
&lt;p&gt;OpenCode Zen is the model routing layer within the OpenCode ecosystem. It abstracts model selection behind capability profiles. You define what you need: &amp;quot;fast edit&amp;quot; or &amp;quot;deep reasoning&amp;quot; or &amp;quot;code review.&amp;quot; Zen picks the cheapest model that satisfies the profile.&lt;/p&gt;
&lt;p&gt;The SLIM plugin system lets you define custom routing rules. OpenCode Zen also supports multi-model conversations where different turns go to different models. The first turn uses Sonnet for planning, and subsequent turns use a local Qwen model for execution.&lt;/p&gt;
&lt;h3&gt;OpenRouter Smart Routing&lt;/h3&gt;
&lt;p&gt;A separate mention because Smart Routing in OpenRouter v2 deserves its own spotlight. This feature tags models by capability (reasoning, coding, vision, tool use, structured output, long context) and prices. Your request specifies requirements; OpenRouter finds the cheapest combination.&lt;/p&gt;
&lt;p&gt;Smart Routing cuts costs by 30 to 50 percent compared to manual model selection. The tradeoff is predictable latency. The cheapest model for a task is not always the fastest.&lt;/p&gt;
&lt;h3&gt;Portkey&lt;/h3&gt;
&lt;p&gt;Portkey started as an observability layer for LLMs and evolved into a full gateway. It offers caching, fallbacks, rate limiting, and guardrails alongside routing. The observability features include cost tracking, latency monitoring, and failure analysis.&lt;/p&gt;
&lt;p&gt;Portkey is more enterprise-oriented than OpenRouter. It is built for teams that need audit trails, compliance controls, and detailed analytics. The open-source self-hosted version gives you full data control.&lt;/p&gt;
&lt;h3&gt;LiteLLM&lt;/h3&gt;
&lt;p&gt;LiteLLM is the Python-native gateway that supports 100-plus providers through a consistent interface. It is lightweight by design, running as a single Python package or Docker container. The SDK translates between provider-specific formats automatically.&lt;/p&gt;
&lt;p&gt;LiteLLM is the default choice for Python projects that need model routing without adding a dependency on a cloud service. It handles rate limiting, retries, and fallback out of the box.&lt;/p&gt;
&lt;h3&gt;Helix (Kilo Code)&lt;/h3&gt;
&lt;p&gt;Kilo Code&apos;s built-in router, Helix, optimizes for coding agent workflows specifically. It understands which models excel at which coding tasks — code generation, refactoring, debugging, test writing — and routes accordingly.&lt;/p&gt;
&lt;p&gt;Helix supports 500-plus models across 60-plus providers. The real-time cost display shows exactly what each model choice costs per turn, which builds intuition about model economics over time.&lt;/p&gt;
&lt;h3&gt;Amazon Bedrock / Google Vertex AI&lt;/h3&gt;
&lt;p&gt;The cloud provider gateways are not the most exciting routers, but they are the most important for enterprise deployments. Bedrock and Vertex AI provide access to multiple models through a single API with enterprise security, compliance certifications, and SLA guarantees.&lt;/p&gt;
&lt;p&gt;Bedrock supports Anthropic, Meta, Mistral, Cohere, and Amazon&apos;s own models. Vertex AI supports Gemini, Claude, and select open models. They charge no markup on model calls, only infrastructure and gateway fees.&lt;/p&gt;
&lt;h3&gt;Gateway Providers (Kong, Azure API Management, Apigee)&lt;/h3&gt;
&lt;p&gt;For organizations that already use API gateways for their microservices, extending them to LLM routing is a natural step. Kong&apos;s AI Gateway, Azure API Management&apos;s model routing, and Google Apigee all support LLM request routing with the same governance controls applied to regular APIs.&lt;/p&gt;
&lt;p&gt;These tools are not designed for individual developers. They are for platform teams that need to centralize LLM access controls, cost allocation, and compliance across their organization.&lt;/p&gt;
&lt;h3&gt;Custom Routing with LangChain / LlamaIndex&lt;/h3&gt;
&lt;p&gt;Some teams build their own routers using LangChain or LlamaIndex. The advantage is complete control over routing logic. You can implement priority queues, multi-model voting, or progressive escalation where a cheaper model handles the first pass and a more expensive one reviews the output.&lt;/p&gt;
&lt;p&gt;The disadvantage is operational complexity. Running your own router means maintaining your own provider integrations, fallback logic, and cost tracking. For most teams, OpenRouter or LiteLLM is the better starting point.&lt;/p&gt;
&lt;h3&gt;AI Gateway by Portkey&lt;/h3&gt;
&lt;p&gt;Portkey&apos;s AI Gateway deserves a second look because it goes beyond routing into full lifecycle management. It offers caching at multiple levels (semantic, exact, prefix), request-level guardrails that block harmful or off-topic prompts before they reach the model, and usage-based billing controls that prevent budget overruns.&lt;/p&gt;
&lt;p&gt;The enterprise version adds SOC 2 compliance, audit logs, and role-based access control. Portkey is the right choice when your organization needs to govern, not just route, model usage.&lt;/p&gt;
&lt;h3&gt;Helicone&lt;/h3&gt;
&lt;p&gt;Helicone focuses on observability for model routers. It captures every request and response, builds usage dashboards, and alerts on cost spikes or latency degradation. It integrates with OpenRouter, LiteLLM, and custom endpoints through a proxy layer.&lt;/p&gt;
&lt;p&gt;Helicone does not route traffic itself. It sits alongside your router and makes the data visible. For teams that want to understand their model spend before optimizing it, Helicone provides the baseline.&lt;/p&gt;
&lt;h3&gt;OpenRouter Model Rankings&lt;/h3&gt;
&lt;p&gt;OpenRouter publishes monthly model rankings based on actual usage data across its platform. The April 2026 rankings showed MiMo V2 Pro at number one with 4.65 trillion tokens processed, followed by Qwen 3.6 Plus at number three. Xiaomi held 22.3 percent of total market share by model count.&lt;/p&gt;
&lt;p&gt;These rankings matter because they reveal what developers actually use, not what benchmarks say. A model that scores high on SWE-Bench but costs five times the runner-up will not see as much production traffic. The rankings are a reality check against benchmark hype.&lt;/p&gt;
&lt;h3&gt;Multi-Model Routing Strategies&lt;/h3&gt;
&lt;p&gt;Beyond specific tools, the routing strategies themselves deserve attention. The most common pattern in 2026 is tiered routing: a cheap local model handles syntax corrections and quick completions, a mid-tier cloud model handles code generation and refactoring, and an expensive reasoning model only activates for architecture decisions and complex bug diagnosis.&lt;/p&gt;
&lt;p&gt;Another pattern gaining traction is ensemble routing, where two models independently solve the same problem and a third model evaluates both solutions. This catches hallucinations by cross-checking outputs. The token cost doubles or triples, but for safety-critical code, the redundancy is worth it.&lt;/p&gt;
&lt;p&gt;Some teams use router-as-judge patterns where the router itself is a lightweight model that evaluates task complexity and routes accordingly. The router model costs pennies per request and prevents expensive models from being wasted on trivial tasks.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Stack&lt;/h2&gt;
&lt;p&gt;There is no single best agentic coding setup. The right combination depends on your workflow, budget, and tolerance for complexity.&lt;/p&gt;
&lt;p&gt;For terminal purists who want maximum capability per dollar, Claude Code with OpenRouter fallback covers most scenarios. Add Hermes Agent for async background tasks, and you have a setup that handles both interactive coding and unattended maintenance.&lt;/p&gt;
&lt;p&gt;For IDE-first developers, Cursor or Windsurf with Claude Code as the background agent gives you the polished editing experience with Cursor&apos;s tab completions and Claude Code&apos;s reasoning capability when you need deep context.&lt;/p&gt;
&lt;p&gt;For teams that want to delegate entirely, OpenClaw or Hermes Agent connected to Slack or Discord, backed by OpenRouter for model routing, lets your team assign tasks through chat and review pull requests when agents finish.&lt;/p&gt;
&lt;p&gt;The model router matters more than most developers think. The difference between paying full retail for Claude Opus and using OpenRouter&apos;s smart routing is often 40 to 60 percent savings. For heavy users, that savings pays for a router subscription several times over.&lt;/p&gt;
&lt;h2&gt;Tradeoffs and Limitations&lt;/h2&gt;
&lt;p&gt;Every tool in this list has blind spots.&lt;/p&gt;
&lt;p&gt;CLI agents are powerful but remove visual feedback. You cannot easily verify UI changes from a terminal.&lt;/p&gt;
&lt;p&gt;Desktop IDEs offer the best integration but lock you into their ecosystem. Moving from Cursor to Windsurf to Antigravity means learning new workflows each time.&lt;/p&gt;
&lt;p&gt;24/7 agents are asynchronous by nature. You give them a task and come back later. For quick edits, the round trip time is worse than just making the change yourself.&lt;/p&gt;
&lt;p&gt;Model routers add a layer of abstraction that can fail. When OpenRouter is down, every tool downstream stops working. Self-hosted routers like LiteLLM avoid this but add operational overhead.&lt;/p&gt;
&lt;p&gt;None of these tools understand your business context. They can generate syntactically correct code that solves the wrong problem. Code review by a human who understands the domain is not optional.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The agentic coding tool landscape in 2026 is defined by diversity and choice. Four years ago, you had GitHub Copilot completions and not much else. Now you have specialized CLI agents, integrated IDEs, autonomous background workers, and intelligent routing that optimizes every API call.&lt;/p&gt;
&lt;p&gt;Start with one category. If you live in the terminal, try Claude Code or OpenCode. If you prefer a GUI, Cursor or Windsurf. If you want to delegate background work, OpenClaw or Hermes Agent. Connect everything through OpenRouter or LiteLLM for model routing.&lt;/p&gt;
&lt;p&gt;Stick with that stack for a month. See what works, what frustrates you, and what you wish the tools did differently. The ecosystem is moving fast enough that a gap today might be a feature next month. That pace is exciting, but it also means the best setup is the one you actually use.&lt;/p&gt;
&lt;p&gt;If this deep dive got you thinking about how agentic systems fit into the bigger picture of data architecture and AI workflows, I have written extensively on both topics. Check out my books on data architecture and agentic AI at &lt;a href=&quot;http://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Active Monitoring: How Agentic AI Auto-Heals and Protects Enterprise Data Pipelines</title><link>https://iceberglakehouse.com/posts/active-monitoring-agentic-ai-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/active-monitoring-agentic-ai-pipelines/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-active-monitoring-agentic-ai-pip...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-active-monitoring-agentic-ai-pipelines/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Active Monitoring: How Agentic AI Auto-Heals and Protects Enterprise Data Pipelines&lt;/h1&gt;
&lt;p&gt;Static alert thresholds work until they don&apos;t. You configure a row count alert for your daily orders table: fire if today&apos;s count is more than 20% below yesterday&apos;s count. The threshold is reasonable on average, but on Mondays after long weekends, Tuesday after a sales spike, and the first of every month when batch reprocessing runs, it fires false positives. After three months of false alarms, the team stops responding to alerts promptly. Then a real failure goes undetected for six hours.&lt;/p&gt;
&lt;p&gt;The problem with static alerts is that they can&apos;t distinguish expected variation from genuine failure. Agentic monitoring systems can : because they reason about the cause of the deviation, not just the deviation itself.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-pipeline-monitoring.png&quot; alt=&quot;Agentic data pipeline monitoring architecture&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Static Alerts Fail at Scale&lt;/h2&gt;
&lt;p&gt;Enterprise data platforms with hundreds of pipelines generate thousands of alert candidates per day. Static threshold alerts : row count drops, latency spikes, error rate increases , are cheap to configure and cheap to ignore.&lt;/p&gt;
&lt;p&gt;The limitations:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No context awareness:&lt;/strong&gt; A 15% row count drop on the orders table might be an upstream data loss, or it might be Monday. The alert fires either way. The on-call engineer has to investigate to distinguish them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No correlation:&lt;/strong&gt; A broken Salesforce API affects the CRM pipeline, which affects the customer model, which affects the revenue forecast, which affects the executive dashboard. Static alerts fire on each affected table independently. The engineer receives five separate alerts without knowing they have a single root cause.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No self-correction:&lt;/strong&gt; When a static alert fires, it waits for a human response. If the human is unavailable, the problem persists. Static systems can detect failures but can&apos;t act on them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alert fatigue:&lt;/strong&gt; When alert noise is high, signal detection degrades. Teams tune thresholds loosely to reduce noise and miss genuine failures.&lt;/p&gt;
&lt;h2&gt;The Agentic Monitoring Architecture&lt;/h2&gt;
&lt;p&gt;An agentic monitoring system replaces static threshold checks with a reasoning agent that investigates deviations autonomously.&lt;/p&gt;
&lt;p&gt;The architecture has three components:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metric collection:&lt;/strong&gt; Regular execution of health check queries against each monitored table or pipeline. Row counts, null rates, maximum values, record freshness, processing latency. These run on a schedule : every 15 minutes for critical pipelines, hourly for less critical ones.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Deviation detection:&lt;/strong&gt; Statistical comparison of current metrics against rolling baselines. Rather than fixed thresholds, the system uses learned baselines: the expected range for this hour of this day of the week, adjusted for recent trends. A 15% drop on Monday morning is different from a 15% drop on a Tuesday afternoon.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Agentic investigation:&lt;/strong&gt; When the deviation detector flags an anomaly, an agent is invoked with the anomaly description and access to query tools. The agent investigates: it queries upstream tables to check whether source data arrived, checks error logs through SQL queries against a logging table, traces the lineage to identify which pipeline stages have run successfully, and determines the root cause.&lt;/p&gt;
&lt;p&gt;The agent&apos;s investigation output is structured: root cause identified, severity assessment, affected downstream pipelines, and recommended action.&lt;/p&gt;
&lt;h2&gt;Anomalous Trace Analysis&lt;/h2&gt;
&lt;p&gt;When an agentic monitoring system detects a pipeline failure, the investigation follows a trace pattern from the symptom backward to the source.&lt;/p&gt;
&lt;p&gt;For a broken daily orders table, the agent traces:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Check the staging table that feeds the orders table : did it receive records today?&lt;/li&gt;
&lt;li&gt;Check the source API extraction job log : did it complete successfully?&lt;/li&gt;
&lt;li&gt;Check the raw landing zone : are files present with today&apos;s timestamp?&lt;/li&gt;
&lt;li&gt;Check the extraction job error table : did any extraction attempts fail?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each check is a SQL query against the relevant metadata or log table. The agent runs them in sequence, stopping when it finds where the chain broke.&lt;/p&gt;
&lt;p&gt;This trace analysis takes 30–60 seconds in a well-configured system. A human engineer doing the same investigation manually typically takes 15–30 minutes, assuming they know the lineage well enough to know which logs to check.&lt;/p&gt;
&lt;h2&gt;Automated Rollback and Recovery&lt;/h2&gt;
&lt;p&gt;When the agentic system identifies a fixable failure : a stuck job, a missing file that needs to be re-fetched, a pipeline that needs a re-run from a checkpoint , it can take action without human intervention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Automatic re-runs:&lt;/strong&gt; If the investigation confirms that the pipeline failed due to a transient network error and the source data is still available, the agent triggers a re-run. It monitors the re-run and confirms success.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Checkpoint rollback:&lt;/strong&gt; If a pipeline produced incorrect output (detected through data quality checks), the agent can roll back the Iceberg table to the last valid snapshot, remove the bad data, and trigger a corrective re-run. Iceberg&apos;s time travel capability makes the rollback operation safe : the previous valid state is available as a snapshot.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Rollback an Iceberg table to the last valid snapshot
ROLLBACK TABLE my_catalog.analytics.orders_daily
TO SNAPSHOT 7234567890123456789;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Notification with context:&lt;/strong&gt; For failures that require human intervention (the source system is down, access credentials have expired), the agent generates a structured notification with the full investigation trace, the root cause diagnosis, the affected pipelines, and the recommended human action. The on-call engineer receives a complete diagnosis, not just an alert number.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/pipeline-failure-trace-rollback.png&quot; alt=&quot;Data pipeline failure trace and rollback workflow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Integrating with Dremio&lt;/h2&gt;
&lt;p&gt;Dremio tables managed through the Open Catalog support Iceberg time travel and snapshot management, which are the mechanisms agentic rollback relies on. An agentic monitoring agent connected to Dremio via the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;MCP server&lt;/a&gt; can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Query Dremio&apos;s metadata tables to check table health&lt;/li&gt;
&lt;li&gt;Examine snapshot history to identify when a table&apos;s content changed unexpectedly&lt;/li&gt;
&lt;li&gt;Execute rollback statements through the SQL interface&lt;/li&gt;
&lt;li&gt;Query pipeline log tables for error analysis&lt;/li&gt;
&lt;li&gt;Trigger Dremio&apos;s automatic table optimization jobs to clean up after a corrupted write&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The combination of Dremio&apos;s semantic layer (for understanding what the data should look like) and Iceberg&apos;s snapshot management (for reverting to a known-good state) makes agentic auto-healing practical rather than theoretical.&lt;/p&gt;
&lt;h2&gt;The Limits of Autonomous Recovery&lt;/h2&gt;
&lt;p&gt;Agentic auto-healing is not appropriate for all failure modes.&lt;/p&gt;
&lt;p&gt;When the root cause is external : a source system that changed its schema, a third-party API that returned corrupted data, a credential that expired , automatic recovery risks hiding the problem rather than fixing it. The pipeline re-runs, fails again, re-runs, and the cycle continues until human intervention.&lt;/p&gt;
&lt;p&gt;Configure your agentic monitoring to auto-heal only for specific, well-defined failure patterns where automatic recovery is safe. For everything else, use the agent for investigation and notification, but require human approval before executing recovery actions.&lt;/p&gt;
&lt;p&gt;Set a maximum auto-heal attempts counter (3 is a common limit) and escalate to pager alert after exhausting auto-heal retries. This prevents silent infinite loops where the agent keeps re-running a broken pipeline.&lt;/p&gt;
&lt;h2&gt;Building the Monitoring Knowledge Base&lt;/h2&gt;
&lt;p&gt;An agentic monitoring system gets better as it accumulates experience with your specific pipelines. Build a knowledge base that the agent can reference during investigations.&lt;/p&gt;
&lt;p&gt;The knowledge base should contain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Known patterns:&lt;/strong&gt; &amp;quot;Orders table always drops 30% on Mondays after long weekends. This is expected.&amp;quot; The agent should recognize this pattern before flagging it as an anomaly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency maps:&lt;/strong&gt; Which pipelines feed which tables. If the agent knows that the revenue_forecast table depends on orders_daily, which depends on the Salesforce extraction job, it can jump directly to checking the extraction job when revenue_forecast fails.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recovery playbooks:&lt;/strong&gt; &amp;quot;If the Salesforce API returns 429 rate limit errors, wait 2 hours and retry. Do not trigger an immediate re-run.&amp;quot; Documented playbooks prevent the agent from taking counterproductive recovery actions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Escalation contacts:&lt;/strong&gt; For each pipeline, who should be notified when auto-heal exhausts its retry limit.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Store this knowledge base in structured form : a table in Dremio or a YAML file the agent loads at startup. The agent reads the knowledge base before investigating any anomaly and uses it to contextualize its analysis.&lt;/p&gt;
&lt;h2&gt;Measuring Monitoring System Quality&lt;/h2&gt;
&lt;p&gt;Track these metrics to evaluate whether your agentic monitoring system is working:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mean time to detect (MTTD):&lt;/strong&gt; How long after a failure occurs does the system identify it? Target under 15 minutes for critical pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;False positive rate:&lt;/strong&gt; What percentage of alerts require no human action because the system flagged expected variation? If your false positive rate is over 20%, review and update your deviation detection baselines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Auto-heal success rate:&lt;/strong&gt; Of the failures where auto-heal was attempted, what percentage resolved without human intervention? A rate above 60% suggests good pattern coverage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mean time to resolution (MTTR):&lt;/strong&gt; For failures that required human intervention, how long did resolution take? Agentic investigation should reduce MTTR by giving engineers a complete diagnosis immediately, rather than requiring them to build the investigation from scratch.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and build your first agentic monitoring workflow on top of your Iceberg data pipelines.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Anatomy of an Agentic Analytics System: Inside the Multi-Step Reasoning Loop</title><link>https://iceberglakehouse.com/posts/anatomy-agentic-analytics-system/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/anatomy-agentic-analytics-system/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-anatomy-agentic-analytics-system...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-anatomy-agentic-analytics-system/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Anatomy of an Agentic Analytics System: Inside the Multi-Step Reasoning Loop&lt;/h1&gt;
&lt;p&gt;When someone asks &amp;quot;how does an agentic analytics system work,&amp;quot; the usual answer is &amp;quot;it uses AI to answer questions.&amp;quot; That&apos;s accurate in the way that &amp;quot;a jet engine uses combustion to fly&amp;quot; is accurate : technically true, completely insufficient for understanding what&apos;s actually happening.&lt;/p&gt;
&lt;p&gt;This post opens the hood on the agentic analytics architecture: what the reasoning loop does, how the LLM uses tools to execute queries, how the agent handles errors and refines its approach, and what makes the difference between an agent that produces useful answers and one that confidently produces wrong ones.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-system-architecture.png&quot; alt=&quot;Agentic analytics system architecture diagram ReAct loop&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The ReAct Loop: Reason, Act, Observe&lt;/h2&gt;
&lt;p&gt;The core pattern in modern agentic analytics systems is ReAct: Reasoning + Acting. The loop runs in three phases that repeat until the agent satisfies the goal or determines it can&apos;t.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reasoning:&lt;/strong&gt; The LLM analyzes the current state : the original goal, any results from previous query steps, any errors encountered , and determines what to do next. This produces a structured &amp;quot;thought&amp;quot;: a description of the hypothesis to test, what data is needed, and which tool to use.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Acting:&lt;/strong&gt; The agent invokes a tool. For analytics, the primary tools are SQL execution and schema exploration. The agent generates a SQL query and sends it to the query engine through a structured function call. The function call is typed and validated : not free-form text that gets parsed, but a defined schema with query text, target catalog, and execution parameters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Observing:&lt;/strong&gt; The system captures the tool&apos;s output and returns it to the agent. For a successful SQL query, this is the result set. For a failed query, this is the error message and any database context available. The agent reads the observation, updates its understanding of the problem, and decides whether to continue, retry with a corrected approach, or declare the investigation complete.&lt;/p&gt;
&lt;p&gt;The loop runs sequentially: one thought, one action, one observation. No parallel reasoning within a single loop iteration. Complex investigations may run 10–20 iterations before reaching a conclusion.&lt;/p&gt;
&lt;h2&gt;Tool Access Design: What the Agent Can Do&lt;/h2&gt;
&lt;p&gt;The quality of an agentic analytics system depends heavily on which tools the agent has access to and how they&apos;re defined.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema exploration tools&lt;/strong&gt; let the agent discover tables, columns, data types, and relationships without prior knowledge of the schema. The agent can call &lt;code&gt;list_schemas()&lt;/code&gt;, &lt;code&gt;describe_table(table_name)&lt;/code&gt;, or &lt;code&gt;get_column_statistics(table, column)&lt;/code&gt; before writing any SQL. This is essential for schemas the agent hasn&apos;t encountered before.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SQL execution tools&lt;/strong&gt; run queries against the data platform and return results. The tool definition should include the max result size (to prevent returning millions of rows to the agent&apos;s context), timeout limits, and which catalog and schema to query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata tools&lt;/strong&gt; access documentation in the semantic layer: wiki descriptions, column labels, metric definitions. When the agent is confused about what a column means, it can query the metadata tool for the documented definition before guessing. Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/&quot;&gt;semantic layer&lt;/a&gt; exposes this metadata through its MCP server, making it available to any agent that connects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Calculation tools&lt;/strong&gt; handle statistical operations that SQL handles poorly : rolling averages, percentile calculations, contribution analysis, variance decomposition. The agent can call these for analysis steps that don&apos;t map cleanly to standard SQL.&lt;/p&gt;
&lt;h2&gt;Schema Exploration in Practice&lt;/h2&gt;
&lt;p&gt;When an agentic system encounters a new question about a schema it hasn&apos;t analyzed before, it starts with exploration rather than querying.&lt;/p&gt;
&lt;p&gt;A well-designed agent begins by listing available schemas, then listing tables in the relevant schema, then describing the tables most likely to contain the relevant data, then examining column statistics to understand data ranges and null rates. This typically takes 3–5 tool calls before the agent writes its first analytical query.&lt;/p&gt;
&lt;p&gt;This exploration step is expensive : it adds 5–10 seconds to the response time. But it prevents the agent from writing SQL that references non-existent columns or joins on mismatched types. The alternative , skipping exploration and trusting the model&apos;s general knowledge : produces errors on real schemas that don&apos;t match the training distribution.&lt;/p&gt;
&lt;p&gt;Semantic layers reduce the exploration cost significantly. When table and column documentation is rich and accurate, the agent can often go straight to the analytical query without full schema exploration, because the metadata provides the context it would otherwise need to discover empirically.&lt;/p&gt;
&lt;h2&gt;Self-Correction: How the Agent Handles Errors&lt;/h2&gt;
&lt;p&gt;The most practically important capability of a multi-step reasoning loop is self-correction. Real analytical environments are messy: data types are inconsistent, column names contain encoding errors, some tables have null primary keys, referenced metrics haven&apos;t been computed for the current period yet.&lt;/p&gt;
&lt;p&gt;When a SQL query returns an error, the agent&apos;s observation is the error message and any database context included in the response. A well-designed agent reads the error carefully and generates a corrected query on the retry.&lt;/p&gt;
&lt;p&gt;Common error patterns and their corrections:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error&lt;/th&gt;
&lt;th&gt;Agent Corrective Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Column not found&lt;/td&gt;
&lt;td&gt;Re-explore table schema, find correct column name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type mismatch in JOIN&lt;/td&gt;
&lt;td&gt;Add CAST to the appropriate column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Division by zero&lt;/td&gt;
&lt;td&gt;Add a CASE statement or filter before dividing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No results returned&lt;/td&gt;
&lt;td&gt;Relax filter conditions, verify date range&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Result too large&lt;/td&gt;
&lt;td&gt;Add LIMIT, aggregate, or narrow the filter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Self-correction doesn&apos;t work indefinitely. Agents should have a maximum retry count per step (typically 3) and a global iteration limit (typically 20). If the agent can&apos;t execute a valid query after 3 retries, it should return a specific &amp;quot;unable to complete&amp;quot; response with the last error message, not hallucinate an answer.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-self-correction-flow.png&quot; alt=&quot;Agentic analytics self-correction error handling flow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Multi-Agent Systems for Complex Analysis&lt;/h2&gt;
&lt;p&gt;Single-agent loops handle most analytical investigations. For complex, multi-domain analyses that require different types of expertise, multi-agent architectures distribute the work.&lt;/p&gt;
&lt;p&gt;A common pattern: an orchestrator agent breaks the high-level goal into sub-goals and delegates each to a specialized sub-agent. A &amp;quot;Data Retrieval&amp;quot; agent handles schema exploration and raw data extraction. A &amp;quot;SQL Writer&amp;quot; agent generates queries based on specifications from the Data Retrieval agent. A &amp;quot;Report Synthesizer&amp;quot; agent assembles the findings into a structured narrative.&lt;/p&gt;
&lt;p&gt;Each agent&apos;s output becomes input for the next. The orchestrator manages sequencing, handles failures, and assembles the final result.&lt;/p&gt;
&lt;p&gt;The downside of multi-agent architectures is latency. Each agent handoff adds network overhead and reasoning time. For questions that need an answer in under 30 seconds, a single well-designed agent with good tool access is usually faster than a multi-agent pipeline.&lt;/p&gt;
&lt;h2&gt;What Determines Agent Quality&lt;/h2&gt;
&lt;p&gt;The accuracy and reliability of an agentic analytics system depend on three things, in roughly equal measure:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic context quality:&lt;/strong&gt; An agent working with rich, accurate metadata produces correct SQL. An agent guessing at column meanings produces plausible-but-wrong SQL. The investment in documentation pays directly in agent accuracy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLM capability:&lt;/strong&gt; Larger, more capable models handle complex multi-step reasoning better than smaller models. The model needs to understand SQL syntax, database concepts, and business logic simultaneously.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tool design:&lt;/strong&gt; Tools that return structured, informative outputs (including error context) give the agent what it needs to self-correct. Tools that return raw errors without context force the agent to guess what went wrong.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;MCP server&lt;/a&gt; exposes the full semantic layer to external agents, providing the context that makes agent outputs reliable.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your own AI agent to a production-grade agentic analytics platform.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Mastering Apache Iceberg v3: What&apos;s New and How to Plan Your Upgrade</title><link>https://iceberglakehouse.com/posts/apache-iceberg-v3-upgrade/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/apache-iceberg-v3-upgrade/</guid><description>
# Apache Iceberg v3: What Changed and How to Upgrade Safely

Apache Iceberg v3 became production-ready with the release of Apache Iceberg 1.11.0 on M...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Apache Iceberg v3: What Changed and How to Upgrade Safely&lt;/h1&gt;
&lt;p&gt;Apache Iceberg v3 became production-ready with the release of Apache Iceberg 1.11.0 on May 19, 2026. The specification had been in development for over a year, and 1.11.0 is the version that locks it in as stable for production workloads. If you run Iceberg tables, you need to understand what changed, what it costs to upgrade, and when to wait.&lt;/p&gt;
&lt;p&gt;This guide covers all six major features in Apache Iceberg v3 and walks through the upgrade path, including which engines are ready and what to test before you flip the switch.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-v3-architecture.png&quot; alt=&quot;Apache Iceberg v3 table format architecture diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What Is Apache Iceberg v3?&lt;/h2&gt;
&lt;p&gt;Apache Iceberg uses a &lt;code&gt;format-version&lt;/code&gt; property stored in each table&apos;s metadata to define which spec features that table uses. Version 1 was the original spec. Version 2 added row-level deletes (merge-on-read) and sequence numbers. Version 3 goes further, adding new data types, encryption, improved deletion performance, and row-level governance.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;format-version&lt;/code&gt; is a table-level property, not a cluster-level setting. Different tables in the same catalog can run different format versions. That means you can upgrade incrementally rather than committing your entire lakehouse at once.&lt;/p&gt;
&lt;p&gt;You need a v3-capable engine to write to a v3 table. You can still read v3 tables with older engines in some cases, but new v3 features like deletion vectors require an engine that understands the format.&lt;/p&gt;
&lt;h2&gt;The Six Features That Matter in Apache Iceberg v3&lt;/h2&gt;
&lt;h3&gt;Binary Deletion Vectors Replace Positional Delete Files&lt;/h3&gt;
&lt;p&gt;This is the biggest performance change in v3 for teams doing row-level updates.&lt;/p&gt;
&lt;p&gt;In Iceberg v2, row-level deletes used positional delete files: separate files listing which rows in a data file had been deleted. As a table accumulated many small delete files, the engine had to merge all of them at read time before returning results. On active tables with frequent updates, this merge overhead added up.&lt;/p&gt;
&lt;p&gt;Iceberg v3 replaces positional delete files with binary deletion vectors. Instead of a separate file per batch of deletes, the engine maintains a compact bitmap per data file. Each bit position represents a row. A set bit means that row is deleted. Reading the bitmap is an order-of-magnitude faster than merging hundreds of small delete files.&lt;/p&gt;
&lt;p&gt;The practical impact: tables with CDC pipelines or frequent UPDATE/DELETE operations read significantly faster after upgrading to v3 and running a compaction pass.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-deletion-vectors-comparison.png&quot; alt=&quot;Iceberg deletion vectors vs positional delete files comparison&quot;&gt;&lt;/p&gt;
&lt;h3&gt;VARIANT, GEOMETRY, and Nanosecond Timestamps&lt;/h3&gt;
&lt;p&gt;Iceberg v3 adds three new first-class data types that address gaps in the previous spec.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;VARIANT&lt;/strong&gt; stores semi-structured or JSON data natively without schema flattening. In v1 and v2, teams stored JSON as a string column and parsed it at query time or pre-flattened it into hundreds of typed columns. VARIANT lets the engine store and access nested structures directly, and query engines can push predicates into the nested data. This is particularly useful for event streams, API response logs, and ML feature stores.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GEOMETRY and GEOGRAPHY&lt;/strong&gt; are first-class spatial types. You can store points, lines, and polygons in Iceberg tables and run spatial joins natively. Before v3, teams stored spatial data as WKT strings and depended on engine-specific extensions for spatial queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Nanosecond timestamps&lt;/strong&gt; (timestamp_ns and timestamptz_ns) meet requirements for high-frequency financial and IoT data. Microsecond precision was sufficient for most workloads, but trading systems and sensor networks generating more than a million events per second need nanosecond fidelity.&lt;/p&gt;
&lt;h3&gt;Default Column Values Eliminate Backfills&lt;/h3&gt;
&lt;p&gt;Adding a new column to a large Iceberg table in v1 and v2 required a choice: leave all historical rows null and handle that downstream, or run a full data rewrite to populate the new column. For tables with terabytes of data, the rewrite was expensive.&lt;/p&gt;
&lt;p&gt;Iceberg v3 lets you define a default value for a new column at the schema level. When you run &lt;code&gt;ALTER TABLE ... ADD COLUMN&lt;/code&gt;, existing data files stay untouched. The engine reads the default value from table metadata and applies it to rows from old files. The operation completes in seconds, not hours.&lt;/p&gt;
&lt;p&gt;This is useful for any team adding metadata columns, rolling out new feature flags, or introducing surrogate keys to existing tables without a maintenance window.&lt;/p&gt;
&lt;h3&gt;Row Lineage Tracking&lt;/h3&gt;
&lt;p&gt;Iceberg v3 formalizes row-level lineage by assigning every row an identity and a modification sequence number. The spec now tracks which operation created or last modified each row and in which snapshot.&lt;/p&gt;
&lt;p&gt;For most analytics workloads this runs transparently in the background. Where it pays off is in compliance and auditing. Regulatory reporting often requires demonstrating exactly which rows changed, when, and through which pipeline. With v3 row lineage, that information is part of the table&apos;s metadata, not a side-channel audit log that has to be maintained separately.&lt;/p&gt;
&lt;p&gt;It also simplifies CDC pipelines. Instead of comparing full row snapshots to detect changes, downstream consumers can read the sequence numbers to identify what changed since the last checkpoint.&lt;/p&gt;
&lt;h3&gt;Multi-Argument Transforms and Table Encryption&lt;/h3&gt;
&lt;p&gt;Two smaller but important additions round out v3.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-argument transforms&lt;/strong&gt; extend the partitioning and sorting spec to accept multiple input columns. In v1 and v2, each transform operated on a single column. v3 lets you express composite partitioning strategies, such as partitioning by both region and truncated date, without workarounds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table encryption keys&lt;/strong&gt; add built-in support for KMS-backed encryption at the table level. Previous encryption approaches required external tooling or relied on object storage bucket policies. v3 makes encryption a first-class property of the Iceberg table, with per-table keys managed through standard KMS integrations.&lt;/p&gt;
&lt;h2&gt;How to Plan Your Apache Iceberg v3 Upgrade&lt;/h2&gt;
&lt;p&gt;Before running any upgrade command, confirm that every engine writing to the table supports Iceberg v3.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Engine support as of mid-2026:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;v3 Read&lt;/th&gt;
&lt;th&gt;v3 Write&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Apache Spark (with Iceberg 1.11.0)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trino&lt;/td&gt;
&lt;td&gt;Check your version&lt;/td&gt;
&lt;td&gt;Check your version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flink&lt;/td&gt;
&lt;td&gt;Partial : verify 1.11.0 support&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (via Open Catalog)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;Interop via REST catalog&lt;/td&gt;
&lt;td&gt;Check release notes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If any engine in your pipeline does not support v3 writes, hold off. A v3 table that a Flink job writes partial records to incorrectly will cause data consistency problems.&lt;/p&gt;
&lt;p&gt;The upgrade command itself is simple:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE my_catalog.my_schema.my_table
SET TBLPROPERTIES (&apos;format-version&apos; = &apos;3&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a metadata-only operation. Your existing data files stay in their current format. The engine will write new files in v3 format going forward. Old files remain readable : they just don&apos;t use v3 features like deletion vectors for their existing data.&lt;/p&gt;
&lt;p&gt;After upgrading, run these validations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Confirm the &lt;code&gt;format-version&lt;/code&gt; property reads back as &lt;code&gt;3&lt;/code&gt; from your metadata tables.&lt;/li&gt;
&lt;li&gt;Run a representative read query and compare results against a pre-upgrade snapshot.&lt;/li&gt;
&lt;li&gt;Execute an UPDATE or DELETE statement and verify the engine writes a deletion vector instead of a positional delete file.&lt;/li&gt;
&lt;li&gt;Check that all engines in your pipeline can still read the table without errors.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;When to wait:&lt;/strong&gt; If your primary query engine hasn&apos;t shipped v3 support in a tested release, or if you&apos;re running Trino versions below the supported range, wait for the version upgrade first. A failed partial write from an incompatible engine is harder to recover from than simply waiting a release cycle.&lt;/p&gt;
&lt;h2&gt;What v3 Means for Your Dremio Lakehouse&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/the-brain-of-the-agentic-lakehouse-inside-dremios-open-catalog-architecture/&quot;&gt;Open Catalog&lt;/a&gt;, built on Apache Polaris, manages Iceberg tables with automatic table optimization. When your tables run v3, Dremio&apos;s background compaction jobs will write deletion vectors instead of positional delete files as part of the normal maintenance cycle. You don&apos;t need to change your compaction strategy.&lt;/p&gt;
&lt;p&gt;The VARIANT type integrates directly with Dremio&apos;s AI SQL functions. &lt;code&gt;AI_GENERATE&lt;/code&gt; can extract structured schemas from VARIANT columns, letting you run LLM-powered analysis on semi-structured data without first flattening it. That closes a gap that previously required a separate transformation step before AI queries could run.&lt;/p&gt;
&lt;p&gt;Row lineage tracking aligns with Dremio&apos;s fine-grained access control (FGAC). Compliance teams running on Dremio can combine row-level sequence numbers from the Iceberg metadata with Dremio&apos;s audit logs to produce end-to-end data lineage reports. The &lt;a href=&quot;https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/&quot;&gt;AI features in Dremio&lt;/a&gt; include metadata generation that can annotate VARIANT and spatial columns automatically.&lt;/p&gt;
&lt;h2&gt;Start with One Table&lt;/h2&gt;
&lt;p&gt;Run the format upgrade on a development or staging table first. Pick a table with active UPDATE workloads so you can measure the deletion vector behavior directly. Compare read times before and after, check deletion vector file sizes against your old positional delete files, and confirm your entire engine stack handles the new format cleanly.&lt;/p&gt;
&lt;p&gt;After that validation passes, roll the upgrade to production in batches, starting with the tables where deletion vector performance gains matter most.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your Iceberg v3 tables against a production-grade query engine from day one.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building a Custom Agentic Analytics System: Python, LangChain, and SQL Data Lakes</title><link>https://iceberglakehouse.com/posts/building-custom-agentic-analytics-python/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/building-custom-agentic-analytics-python/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-building-custom-agentic-analytic...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-building-custom-agentic-analytics-python/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Building a Custom Agentic Analytics System: Python, LangChain, and SQL Data Lakes&lt;/h1&gt;
&lt;p&gt;Building your own agentic analytics system is a reasonable choice if you need custom investigation logic, specific tool integrations, or control over how the agent reasons about your schema. The open-source tooling is mature enough in 2026 that you can have a working prototype in an afternoon, and a production-grade system in a few weeks.&lt;/p&gt;
&lt;p&gt;This tutorial walks through building a SQL analytics agent using Python, LangChain, and Dremio as the data layer. The agent can explore schemas, write SQL, correct its own errors, and return structured analytical results.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/python-langchain-analytics-agent.png&quot; alt=&quot;Custom agentic analytics system Python LangChain architecture&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;You&apos;ll need Python 3.11+, a Dremio Cloud account (the free trial works), and the following packages:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install langchain langchain-openai sqlalchemy pydremio python-dotenv
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You&apos;ll also need an OpenAI API key for the LLM, and your Dremio connection credentials: host, PAT (personal access token), and the target catalog and schema.&lt;/p&gt;
&lt;h2&gt;Connecting to Dremio&lt;/h2&gt;
&lt;p&gt;Dremio exposes a JDBC-compatible SQL interface. LangChain&apos;s SQL toolkit wraps SQLAlchemy, which connects to Dremio through the Arrow Flight SQL protocol.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from sqlalchemy import create_engine
from langchain.sql_database import SQLDatabase
import os

DREMIO_HOST = os.getenv(&amp;quot;DREMIO_HOST&amp;quot;)
DREMIO_TOKEN = os.getenv(&amp;quot;DREMIO_TOKEN&amp;quot;)

# Dremio JDBC connection string via Arrow Flight SQL
engine = create_engine(
    f&amp;quot;dremio+flight://{DREMIO_HOST}:32010/dremio&amp;quot;,
    connect_args={
        &amp;quot;token&amp;quot;: DREMIO_TOKEN,
        &amp;quot;disableCertificateVerification&amp;quot;: False,
    }
)

db = SQLDatabase(
    engine,
    schema=&amp;quot;my_catalog.analytics&amp;quot;,  # Limit to specific schema for safety
    include_tables=[&amp;quot;orders&amp;quot;, &amp;quot;customers&amp;quot;, &amp;quot;revenue_daily&amp;quot;, &amp;quot;product_catalog&amp;quot;]
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Limiting &lt;code&gt;include_tables&lt;/code&gt; to your relevant tables serves two purposes: it reduces the schema context the agent loads (improving reasoning speed), and it prevents the agent from exploring tables it shouldn&apos;t access.&lt;/p&gt;
&lt;h2&gt;Building the SQL Agent&lt;/h2&gt;
&lt;p&gt;LangChain&apos;s &lt;code&gt;create_sql_agent&lt;/code&gt; wraps the ReAct loop with SQL-specific tools: schema inspection, sample query generation, and query execution.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain_openai import ChatOpenAI
from langchain.agents.agent_types import AgentType

llm = ChatOpenAI(
    model=&amp;quot;gpt-4o&amp;quot;,
    temperature=0,  # Zero temperature for deterministic SQL generation
    openai_api_key=os.getenv(&amp;quot;OPENAI_API_KEY&amp;quot;)
)

toolkit = SQLDatabaseToolkit(db=db, llm=llm)

agent = create_sql_agent(
    llm=llm,
    toolkit=toolkit,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,  # Set to False in production
    max_iterations=15,
    max_execution_time=60,  # Hard timeout in seconds
    handle_parsing_errors=True
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Setting &lt;code&gt;temperature=0&lt;/code&gt; is important for SQL generation. Higher temperatures introduce randomness that produces creative but often invalid SQL. For schema exploration and analytical reasoning, determinism is preferable.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;max_iterations=15&lt;/code&gt; prevents runaway loops. An investigation that hasn&apos;t converged in 15 steps likely won&apos;t converge at all : either the question can&apos;t be answered with available data, or the agent is stuck in an error cycle.&lt;/p&gt;
&lt;h2&gt;Prompt Configuration for Your Schema&lt;/h2&gt;
&lt;p&gt;The default LangChain SQL agent prompt gives the LLM generic instructions for SQL databases. For production use, add schema-specific context in the system prompt.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

SYSTEM_PROMPT = &amp;quot;&amp;quot;&amp;quot;You are an analytical assistant for Acme Corp.
You have access to the following tables in the analytics schema:

- orders: Transaction records with order_id, customer_id, product_id,
  amount_usd, order_date, status
- customers: Customer master with customer_id, region, segment,
  acquisition_date
- revenue_daily: Pre-aggregated daily revenue by region and product line
- product_catalog: Product metadata with product_id, category,
  unit_cost, launch_date

Important business definitions:
- &amp;quot;Active customer&amp;quot;: customer with at least one order in the last 30 days
- &amp;quot;Revenue&amp;quot;: sum of amount_usd where status = &apos;completed&apos;
- &amp;quot;This quarter&amp;quot;: current calendar quarter based on order_date

Always verify your results make sense against expected scale.
Monthly revenue should be in the range $2M-$15M.
If a query returns a value outside that range, check your WHERE clause.&amp;quot;&amp;quot;&amp;quot;

agent = create_sql_agent(
    llm=llm,
    toolkit=toolkit,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    prefix=SYSTEM_PROMPT,
    max_iterations=15,
    max_execution_time=60,
    handle_parsing_errors=True
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The business definitions in the system prompt are critical. Without them, the agent interprets &amp;quot;revenue&amp;quot; using generic SQL patterns and may count cancelled orders or use gross rather than net amounts. The range check (&amp;quot;Monthly revenue should be in $2M-$15M&amp;quot;) catches gross calculation errors before they reach the user.&lt;/p&gt;
&lt;h2&gt;Secure Execution Sandboxes&lt;/h2&gt;
&lt;p&gt;A SQL agent connected to production data needs guardrails. The most important: read-only access.&lt;/p&gt;
&lt;p&gt;Configure your Dremio PAT with SELECT permissions only, on the schemas you&apos;ve included in the agent. No CREATE, INSERT, UPDATE, or DELETE. This prevents the agent from accidentally modifying data through a malformed write query.&lt;/p&gt;
&lt;p&gt;Add a query result size limit:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;db = SQLDatabase(
    engine,
    schema=&amp;quot;my_catalog.analytics&amp;quot;,
    include_tables=[&amp;quot;orders&amp;quot;, &amp;quot;customers&amp;quot;, &amp;quot;revenue_daily&amp;quot;, &amp;quot;product_catalog&amp;quot;],
    sample_rows_in_table_info=3,  # Sample rows for context, not full scan
    max_string_length=300  # Truncate long string columns in results
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For production deployments, log every query the agent executes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.callbacks import FileCallbackHandler
import logging

logging.basicConfig(filename=&amp;quot;agent_queries.log&amp;quot;, level=logging.INFO)

def log_query(query: str, result: str):
    logging.info(f&amp;quot;QUERY: {query}&amp;quot;)
    logging.info(f&amp;quot;RESULT_ROWS: {len(result.split(chr(10)))}&amp;quot;)

# Add to your agent invocation wrapper
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/langchain-dremio-execution-flow.png&quot; alt=&quot;LangChain SQL agent execution flow with Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Running Your First Investigation&lt;/h2&gt;
&lt;p&gt;With the agent configured, run an analytical question:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;result = agent.invoke({
    &amp;quot;input&amp;quot;: &amp;quot;What were the top 3 product categories by revenue last month, &amp;quot;
             &amp;quot;and how did each compare to the same period last year?&amp;quot;
})

print(result[&amp;quot;output&amp;quot;])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The agent will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Inspect the available tables&lt;/li&gt;
&lt;li&gt;Write a SQL query joining orders and product_catalog, filtered to last month&lt;/li&gt;
&lt;li&gt;Run the query and read results&lt;/li&gt;
&lt;li&gt;Write a second query for the same period last year&lt;/li&gt;
&lt;li&gt;Compare the results and produce a structured narrative&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The full trace (visible with &lt;code&gt;verbose=True&lt;/code&gt;) shows every thought, action, and observation. Review the trace during development to understand where the agent makes assumptions and whether those assumptions match your business logic.&lt;/p&gt;
&lt;h2&gt;What This Approach Doesn&apos;t Cover&lt;/h2&gt;
&lt;p&gt;A Python/LangChain prototype is useful for learning the architecture and testing hypotheses. It&apos;s not a production-ready agentic analytics system.&lt;/p&gt;
&lt;p&gt;Production systems need: multi-user session isolation, rate limiting per user, more sophisticated error handling than &lt;code&gt;handle_parsing_errors=True&lt;/code&gt;, persistent conversation history, and governance-aware tool access that respects your catalog&apos;s access control.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s built-in AI agent handles these requirements. Its &lt;a href=&quot;https://www.dremio.com/ai-agent/&quot;&gt;built-in AI Agent&lt;/a&gt; runs within the platform&apos;s governance model, respects the same access controls as human analysts, and logs every query to the audit trail. The Python approach gives you control and customizability; the platform approach gives you governance and production reliability.&lt;/p&gt;
&lt;h2&gt;Extending the Agent with Custom Tools&lt;/h2&gt;
&lt;p&gt;The LangChain &lt;code&gt;create_sql_agent&lt;/code&gt; function accepts a &lt;code&gt;extra_tools&lt;/code&gt; parameter for adding tools beyond the default SQL toolkit. Custom tools let your agent do things the default toolkit can&apos;t.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.tools import Tool
import requests

def get_exchange_rate(currency_pair: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;Fetch current exchange rate for enriching financial analysis.&amp;quot;&amp;quot;&amp;quot;
    base, quote = currency_pair.split(&amp;quot;/&amp;quot;)
    resp = requests.get(f&amp;quot;https://api.exchangerate.host/convert?from={base}&amp;amp;to={quote}&amp;quot;)
    rate = resp.json().get(&amp;quot;result&amp;quot;, &amp;quot;unavailable&amp;quot;)
    return f&amp;quot;Current {base}/{quote} rate: {rate}&amp;quot;

exchange_tool = Tool(
    name=&amp;quot;get_exchange_rate&amp;quot;,
    description=&amp;quot;Use when the user asks about revenue in foreign currencies. Input: currency pair like USD/EUR&amp;quot;,
    func=get_exchange_rate
)

agent = create_sql_agent(
    llm=llm,
    toolkit=toolkit,
    extra_tools=[exchange_tool],
    max_iterations=15
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The agent now knows it can convert currencies when analyzing cross-regional revenue. It will use this tool when the investigation requires it and fall back to SQL-only analysis when it doesn&apos;t.&lt;/p&gt;
&lt;p&gt;This extensibility is where custom agents earn their complexity premium over platform-native solutions. You can add tools that fetch context from your CRM, look up product catalog metadata from a REST API, or retrieve historical benchmark data from a time-series database. The agent reasons about when to use each tool and chains them together in its investigation.&lt;/p&gt;
&lt;h2&gt;Evaluating Agent Output Quality&lt;/h2&gt;
&lt;p&gt;Before putting any SQL agent in front of real users, establish a way to measure whether its outputs are correct.&lt;/p&gt;
&lt;p&gt;Build a test suite of questions with known correct SQL and expected result ranges:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;TEST_CASES = [
    {
        &amp;quot;question&amp;quot;: &amp;quot;What was total revenue last month?&amp;quot;,
        &amp;quot;expected_sql_contains&amp;quot;: [&amp;quot;SUM(amount_usd)&amp;quot;, &amp;quot;status = &apos;completed&apos;&amp;quot;],
        &amp;quot;result_range&amp;quot;: (2_000_000, 15_000_000)
    },
    {
        &amp;quot;question&amp;quot;: &amp;quot;How many active customers do we have?&amp;quot;,
        &amp;quot;expected_sql_contains&amp;quot;: [&amp;quot;COUNT&amp;quot;, &amp;quot;order_date&amp;quot;],
        &amp;quot;result_range&amp;quot;: (10_000, 500_000)
    }
]

def evaluate_agent(agent, test_cases):
    results = []
    for case in test_cases:
        response = agent.invoke({&amp;quot;input&amp;quot;: case[&amp;quot;question&amp;quot;]})
        results.append({
            &amp;quot;question&amp;quot;: case[&amp;quot;question&amp;quot;],
            &amp;quot;passed&amp;quot;: True  # Implement your validation logic
        })
    return results
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run this evaluation after any schema change, any system prompt update, or any LLM model upgrade. The test suite tells you whether the agent still produces correct outputs for the cases you&apos;ve verified.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to compare your custom agent&apos;s outputs against the platform&apos;s built-in agent on the same data.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Death of the Data Swamp: Establishing Governance in Your 2026 Data Lakehouse</title><link>https://iceberglakehouse.com/posts/data-governance-lakehouse-2026/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/data-governance-lakehouse-2026/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-data-governance-lakehouse-2026/)...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-data-governance-lakehouse-2026/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;The Death of the Data Swamp: Establishing Governance in Your 2026 Data Lakehouse&lt;/h1&gt;
&lt;p&gt;A data lake becomes a data swamp when teams stop trusting it. Tables accumulate with no clear owners. Column names mean different things in different tables. Schema changes break downstream jobs silently. No one knows which version of &amp;quot;revenue&amp;quot; is correct.&lt;/p&gt;
&lt;p&gt;Lakehouses solve many of the technical problems that created swamps : ACID transactions, schema evolution controls, time travel , but they don&apos;t solve the governance problem automatically. You still need active stewardship, clear metadata standards, and the tooling to enforce them.&lt;/p&gt;
&lt;p&gt;This post covers the practical governance model for a modern data lakehouse in 2026.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/lakehouse-governance-architecture.png&quot; alt=&quot;Data lakehouse governance architecture&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Lakehouses Still Become Swamps&lt;/h2&gt;
&lt;p&gt;The file-based nature of data lakehouses makes it easy to land data without structure. An engineer drops a Parquet file in a directory, registers it as an Iceberg table, and it&apos;s queryable. The data is there, but without documentation, access controls, or schema ownership, no one knows if it&apos;s correct or who maintains it.&lt;/p&gt;
&lt;p&gt;The three failure modes:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No schema ownership:&lt;/strong&gt; When a table&apos;s schema changes : a column renamed, a type widened, a partition scheme updated , there&apos;s no one accountable for notifying downstream consumers. Broken pipelines are discovered by business users who find blank cells in their dashboards, not by the data team that made the change.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metric inconsistency:&lt;/strong&gt; Multiple teams define the same concept differently. Finance calculates &amp;quot;monthly active users&amp;quot; as users with at least one session in the month. Marketing calculates it as users who opened an email or logged in. Both are plausible. Neither is documented. Executives see different numbers and lose confidence in the platform.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Access without accountability:&lt;/strong&gt; Data lands in the lake, and access permissions are permissive by default. Analysts find useful tables and start building reports on them. No one notices that some of those tables contain unmasked PII or that the data is from an unvalidated source.&lt;/p&gt;
&lt;h2&gt;The Three Components of Active Governance&lt;/h2&gt;
&lt;p&gt;Active governance requires three things working together: metadata stewardship, schema evolution safety, and data drift detection.&lt;/p&gt;
&lt;h3&gt;Metadata Stewardship&lt;/h3&gt;
&lt;p&gt;Every table in your lakehouse should have a documented owner, a description of its contents, classifications for sensitive columns, and a record of which downstream processes depend on it.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s semantic layer makes this executable. Wikis attach human-written (or AI-generated) documentation directly to datasets and columns. Labels classify columns as PII, financial, operational, or other categories. This metadata lives in the catalog alongside the schema : it&apos;s not in a separate documentation system that goes out of date.&lt;/p&gt;
&lt;p&gt;The AI-generated metadata feature in Dremio samples a table&apos;s schema and contents, then generates wiki descriptions for each column. A human steward reviews and approves. This reduces the labor cost of documentation enough that teams actually do it, rather than treating metadata as optional.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Assign clear ownership before a table reaches production.&lt;/strong&gt; A table without an owner is a table without accountability. Ownership should be at the team level, not the individual level : individual owners leave organizations; teams don&apos;t.&lt;/p&gt;
&lt;h3&gt;Schema Evolution Safety&lt;/h3&gt;
&lt;p&gt;Iceberg&apos;s schema evolution rules prevent the most destructive changes from happening silently. Adding a column is always safe. Dropping a column requires that no downstream manifest references it. Widening a type (int to long) is safe. Narrowing a type is not allowed.&lt;/p&gt;
&lt;p&gt;But schema evolution rules only prevent accidental damage at the format level. They don&apos;t prevent business-logic breaks : a column that&apos;s renamed from &lt;code&gt;revenue_usd&lt;/code&gt; to &lt;code&gt;gross_revenue_usd&lt;/code&gt; is technically valid but breaks every downstream query that references the old name.&lt;/p&gt;
&lt;p&gt;Use Dremio&apos;s virtual datasets as the stable contract layer. Build views on top of raw Iceberg tables. Downstream consumers query the views, not the tables. When a column changes in the underlying table, update the view to preserve the old name as an alias. Consumers are unaffected.&lt;/p&gt;
&lt;p&gt;The medallion architecture formalizes this: Bronze tables match raw source data and change frequently. Silver views translate Bronze schemas into stable business terms. Gold views serve specific applications. Changes in Bronze propagate through Silver only after explicit review.&lt;/p&gt;
&lt;h3&gt;Data Drift Detection&lt;/h3&gt;
&lt;p&gt;Data drift is when the actual data in a table starts diverging from what the table is supposed to contain. A sensor stops sending readings and the column fills with nulls. A source system changes its encoding and string fields start arriving with unexpected characters. A calculation pipeline changes its logic and historical aggregates shift.&lt;/p&gt;
&lt;p&gt;Drift doesn&apos;t trigger errors : it produces wrong results silently.&lt;/p&gt;
&lt;p&gt;The minimum viable drift detection setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Monitor null rates per column in each daily batch. Alert when a column&apos;s null rate increases by more than 5% from its 30-day average.&lt;/li&gt;
&lt;li&gt;Monitor record counts per partition. Alert when a partition receives significantly fewer records than historical average.&lt;/li&gt;
&lt;li&gt;Run automated reconciliation queries that compare key aggregates (row counts, sums of financial columns) against the previous period.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s AI SQL functions can automate some of this. &lt;code&gt;AI_CLASSIFY&lt;/code&gt; can flag rows that don&apos;t match expected patterns. &lt;code&gt;AI_COMPLETE&lt;/code&gt; can summarize anomaly reports into human-readable alerts. These aren&apos;t replacements for purpose-built data quality monitoring tools, but they can catch obvious drift quickly.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/medallion-stewardship-flow.png&quot; alt=&quot;Data governance medallion architecture stewardship flow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Catalog-Level Enforcement Points&lt;/h2&gt;
&lt;p&gt;Governance needs enforcement, not just documentation. The three enforcement points:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Catalog registration requirement:&lt;/strong&gt; Tables can&apos;t be queried until they have a registered owner and a non-empty description. This can be enforced through catalog policies in Dremio&apos;s Open Catalog : tables without metadata fail the registration check.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Access request workflow:&lt;/strong&gt; Default access to new tables is read-restricted. Access grants require approval from the table owner. This prevents the &amp;quot;permissive by default&amp;quot; pattern that turns lakehouses into swamps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Breaking change review:&lt;/strong&gt; Schema changes to production tables go through a review process. The review includes an impact analysis: which virtual datasets, reports, and pipelines reference the changed table or column.&lt;/p&gt;
&lt;p&gt;These processes require organizational discipline in addition to tooling. The tooling provides the audit log and the enforcement mechanism. The organizational process provides the review and approval step.&lt;/p&gt;
&lt;h2&gt;Where to Start&lt;/h2&gt;
&lt;p&gt;If you&apos;re running an existing lakehouse that has accumulated ungoverned tables, the recovery path is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Audit: Generate a list of all tables, their last modified date, their query frequency, and whether they have a documented owner.&lt;/li&gt;
&lt;li&gt;Triage: Mark tables that have no owner and no queries in the last 90 days as candidates for deprecation. Tables with active queries get an owner assigned retroactively.&lt;/li&gt;
&lt;li&gt;Document: Use Dremio&apos;s AI metadata generation to create wiki drafts for active tables. Have a human review and approve each one.&lt;/li&gt;
&lt;li&gt;Enforce: Set the catalog registration requirement and access control policies going forward. Grandfather existing tables with a deadline for compliance.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Building governance from scratch into a new lakehouse is simpler than fixing an existing one. Set the standards before the first table lands.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start with a governed, documented catalog from day one.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How Apache Iceberg Resolves the Hybrid-Cloud Challenge in Heavily Regulated Markets</title><link>https://iceberglakehouse.com/posts/iceberg-hybrid-cloud-regulated-markets/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-hybrid-cloud-regulated-markets/</guid><description>
# How Apache Iceberg Resolves the Hybrid-Cloud Challenge in Heavily Regulated Markets

Financial institutions in Japan, Germany, and similar regulate...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;How Apache Iceberg Resolves the Hybrid-Cloud Challenge in Heavily Regulated Markets&lt;/h1&gt;
&lt;p&gt;Financial institutions in Japan, Germany, and similar regulated markets face a specific architectural problem. Their regulators require sensitive data to stay on-premises or within a defined geographic boundary. Their data teams want cloud-scale analytics. Those two requirements pull in opposite directions, and proprietary cloud warehouses make the conflict worse.&lt;/p&gt;
&lt;p&gt;Apache Iceberg resolves this by separating what the data is stored as from where it is stored and which engine queries it. That separation gives regulated enterprises a path to hybrid-cloud analytics that doesn&apos;t compromise data residency compliance.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-hybrid-cloud-architecture.png&quot; alt=&quot;Apache Iceberg hybrid cloud architecture for regulated markets&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Data Residency Problem in Regulated Markets&lt;/h2&gt;
&lt;p&gt;Japan&apos;s Act on the Protection of Personal Information (APPI) and the Financial Services Agency (FSA) guidelines restrict where personal and financial data can be processed. Germany&apos;s BAFIN guidelines and the EU&apos;s GDPR impose similar constraints. These regulations don&apos;t just limit storage : in some cases, they constrain which compute resources can access sensitive rows.&lt;/p&gt;
&lt;p&gt;Proprietary cloud warehouses create a compliance problem because they bundle storage and compute into a single hosted system. Your data goes into their infrastructure. You may have some region selection options, but the catalog, access control, and audit logs all run in the vendor&apos;s cloud. For regulated Japanese institutions, that means customer data flowing through infrastructure they don&apos;t control.&lt;/p&gt;
&lt;p&gt;The result is a two-system architecture that most regulated enterprises default to: an on-premises system for sensitive data and a cloud warehouse for analytical workloads on non-sensitive data. Those systems require ETL to synchronize them, which adds cost, latency, and yet another point where data moves across boundaries.&lt;/p&gt;
&lt;h2&gt;How Apache Iceberg Changes the Architecture&lt;/h2&gt;
&lt;p&gt;Iceberg tables store their data in Parquet files on object storage. That storage can be on-premises (using S3-compatible systems like MinIO or Ceph), in a private cloud, or in a public cloud region that meets residency requirements. The Iceberg table format itself doesn&apos;t dictate where the storage lives.&lt;/p&gt;
&lt;p&gt;The Iceberg catalog :  which tracks table metadata, schema, and file locations ,  is also storage-agnostic. You can run an open-source Iceberg REST catalog entirely within your own data center. Compute engines connect to that catalog to discover tables and get file locations. No data ever leaves your controlled environment unless you explicitly configure an engine to move it.&lt;/p&gt;
&lt;p&gt;This creates an architecture where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sensitive data stays on-premises in your own object storage&lt;/li&gt;
&lt;li&gt;An on-premises or private-cloud Iceberg catalog manages metadata&lt;/li&gt;
&lt;li&gt;Analytical engines can be on-premises or in a private cloud connected by VPN&lt;/li&gt;
&lt;li&gt;Non-sensitive workloads can run on public cloud compute engines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All of these engines read the same Iceberg table format. You don&apos;t need separate data models or ETL jobs to maintain different copies.&lt;/p&gt;
&lt;h2&gt;Apache Polaris as the Cross-Environment Catalog&lt;/h2&gt;
&lt;p&gt;Apache Polaris is an open-source implementation of the Iceberg REST catalog specification. Because it follows an open standard, any Iceberg-compatible engine (Spark, Trino, Flink, Dremio) can connect to it without vendor-specific connectors.&lt;/p&gt;
&lt;p&gt;For regulated environments, Polaris matters for two reasons.&lt;/p&gt;
&lt;p&gt;First, you can run it yourself. Unlike vendor-managed catalogs that live in the vendor&apos;s cloud, a self-hosted Polaris instance stays within your infrastructure boundary. You control the authentication, the access logs, and the retention of catalog metadata.&lt;/p&gt;
&lt;p&gt;Second, Polaris provides role-based access control (RBAC) at the catalog level with credential vending. When an engine needs to read a table, Polaris issues short-lived, scoped credentials. The engine only gets access to the specific storage paths its role permits. Even if a compute cluster in a less-restricted zone connects to the catalog, Polaris controls which data files it can actually read.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s Open Catalog extends this model by combining Apache Polaris with federated source connections. A Dremio deployment in a controlled environment can serve as the authorized access point for regulated Iceberg tables, enforcing fine-grained access control (FGAC) including row-level filtering and column masking through user-defined functions (UDFs). External engines in less-sensitive zones connect through Dremio, which enforces the governance policies consistently.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/polaris-hybrid-catalog-model.png&quot; alt=&quot;Apache Polaris open catalog hybrid deployment model&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Cross-Catalog Synchronization for Mixed Workloads&lt;/h2&gt;
&lt;p&gt;Not all data in a regulated institution is sensitive. General ledger aggregates, anonymized customer segments, and macroeconomic indicators often carry no residency restrictions. Those datasets can live in a public cloud Iceberg catalog and be queried by cloud-based compute without any compliance concern.&lt;/p&gt;
&lt;p&gt;The challenge is joining sensitive on-premises tables with non-sensitive cloud tables in a single query. Iceberg&apos;s open catalog standard makes this possible through catalog federation. A query engine connected to both catalogs can reference tables from each in the same SQL statement. The engine plans the query, reads from each location according to the credentials it holds, and assembles the result.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s query federation handles this pattern directly. It connects to multiple catalogs :  one on-premises, one in the cloud ,  and presents them in a unified namespace. An analyst writes a single SQL query. Dremio handles the cross-environment execution, applying access control from each catalog at the appropriate step.&lt;/p&gt;
&lt;p&gt;The on-premises data never moves to the cloud. The cloud data never gets pulled into the on-premises system unnecessarily. Predicate pushdown filters data at the source before it crosses the network boundary.&lt;/p&gt;
&lt;h2&gt;Practical Implementation Considerations&lt;/h2&gt;
&lt;p&gt;Running a hybrid Iceberg deployment in a regulated environment requires attention to a few operational details.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Network segmentation:&lt;/strong&gt; The connection between on-premises infrastructure and cloud compute environments should go through a dedicated private link or VPN, not the public internet. Iceberg metadata and credential vending traffic is small, but it carries authorization information that must be protected in transit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit trails:&lt;/strong&gt; Catalog-level audit logs from Polaris and engine-level query logs must both be retained within the compliance boundary. If your regulation requires a 7-year audit trail, your catalog logs need the same retention policy as your transaction records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Encryption:&lt;/strong&gt; Iceberg v3 adds built-in table-level encryption with KMS-backed keys. For regulated data at rest, use this combined with your on-premises KMS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-engine schema enforcement:&lt;/strong&gt; When multiple engines write to the same Iceberg table, they all go through the same catalog. The catalog enforces schema evolution rules, preventing any single engine from accidentally dropping a column that another engine depends on.&lt;/p&gt;
&lt;p&gt;Start with one regulated dataset, deploy an on-premises Polaris instance, and validate that your compliance team can trace every access back through the catalog audit log. Once that proof of concept holds up, expand to additional tables.&lt;/p&gt;
&lt;h2&gt;Building the Audit Trail for Regulators&lt;/h2&gt;
&lt;p&gt;Regulated institutions need to produce complete access records for auditors on demand. &amp;quot;Complete&amp;quot; means: who accessed which data, from which system, at what time, and what SQL they ran.&lt;/p&gt;
&lt;p&gt;An Iceberg deployment built on Apache Polaris and Dremio gives you this at two levels. The Polaris catalog log captures every metadata request : table schema lookups, file location requests, and access control checks. The Dremio audit log captures every SQL query, the user who ran it, the tables accessed, and the result row count.&lt;/p&gt;
&lt;p&gt;Combined, these logs provide a chain of custody from the user request through catalog authorization to storage access. That&apos;s the audit trail BAFIN auditors and FSA examiners expect.&lt;/p&gt;
&lt;p&gt;Store both log streams in your compliance-approved log management system, not just in the vendor&apos;s infrastructure. If your regulation requires a 7-year retention window, configure your log export before you go to production.&lt;/p&gt;
&lt;h2&gt;Handling Engine Upgrades Without Data Migration&lt;/h2&gt;
&lt;p&gt;One benefit of the Iceberg format that&apos;s easy to overlook: you can upgrade your query engine without migrating your data.&lt;/p&gt;
&lt;p&gt;With a proprietary warehouse, switching vendors means exporting every table, converting formats, and reloading. With Iceberg, your data stays in the same Parquet files on the same object storage. Upgrading from an older Trino version to a newer one, or switching from Spark to Dremio for interactive queries, requires only a catalog connection change : not a data migration.&lt;/p&gt;
&lt;p&gt;For regulated institutions that move slowly on infrastructure changes (and most do), this matters. You can evaluate a new query engine against your production data before committing to a cutover. If the new engine produces different results on the same SQL, you investigate the discrepancy before you rely on the new system.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/the-brain-of-the-agentic-lakehouse-inside-dremios-open-catalog-architecture/&quot;&gt;Open Catalog architecture&lt;/a&gt; gives you a production-grade foundation for this model. &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to explore how the federated catalog model works.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Securing Apache Iceberg Tables with Fine-Grained Row and Column Level Access Control</title><link>https://iceberglakehouse.com/posts/iceberg-row-column-access-control/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-row-column-access-control/</guid><description>
# Securing Apache Iceberg Tables with Fine-Grained Row and Column Level Access Control

Apache Iceberg handles table format, schema evolution, and me...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Securing Apache Iceberg Tables with Fine-Grained Row and Column Level Access Control&lt;/h1&gt;
&lt;p&gt;Apache Iceberg handles table format, schema evolution, and metadata management. What it doesn&apos;t handle is access control. The spec defines how data is structured and stored, not who can see which rows or whether a phone number column gets masked for certain users.&lt;/p&gt;
&lt;p&gt;That gap isn&apos;t a flaw : it&apos;s a design choice. Security belongs in the catalog and query engine layer, not in the file format. But it means you need to understand which layer does which job before you assume your Iceberg tables are actually secured.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-governance-stack.png&quot; alt=&quot;Apache Iceberg governance stack showing catalog and engine layers&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What Apache Iceberg Controls (and What It Doesn&apos;t)&lt;/h2&gt;
&lt;p&gt;Iceberg manages everything about the physical table: file layout, schema history, partition structure, snapshot history, and encryption at rest (in v3). It does not control which users can read which rows, and it cannot mask column values based on user identity.&lt;/p&gt;
&lt;p&gt;If you give a compute engine direct read access to the Parquet files in your S3 bucket, that engine reads all rows and all columns. The Iceberg metadata tells it where the files are; nothing in the spec prevents it from reading the full content.&lt;/p&gt;
&lt;p&gt;Row-level security and column masking must be enforced at the catalog level, the engine level, or both. The catalog-level approach is more reliable because it applies consistently regardless of which engine connects. Engine-level policies work but are engine-specific : a policy you configure in Trino doesn&apos;t automatically apply when the same table is queried through Spark.&lt;/p&gt;
&lt;h2&gt;Apache Polaris RBAC: The Catalog Layer Foundation&lt;/h2&gt;
&lt;p&gt;Apache Polaris provides role-based access control (RBAC) for Iceberg tables through a hierarchy of service principals, principal roles, and catalog roles. Service principals represent identities (users, applications, or engines). Principal roles group service principals. Catalog roles define what a group of principals can do with specific namespaces or tables.&lt;/p&gt;
&lt;p&gt;The key security feature in Polaris is credential vending. When a compute engine requests access to an Iceberg table, Polaris doesn&apos;t give it a long-lived storage key. Instead, it issues short-lived, scoped credentials that give the engine access only to the specific storage paths it&apos;s authorized to read. The engine can&apos;t access files outside those paths, even if it knows the bucket structure.&lt;/p&gt;
&lt;p&gt;This means even if a compromised compute engine tries to scan your full S3 bucket, it gets credentials that only cover the paths Polaris has authorized. The storage policy is enforced at the catalog level, not just at the bucket IAM level.&lt;/p&gt;
&lt;p&gt;What Polaris doesn&apos;t do natively is row-level filtering or column masking. A catalog role either grants access to a table or it doesn&apos;t. For finer-grained control :  different rows visible to different roles, or SSN columns masked for analysts ,  you need the query engine layer.&lt;/p&gt;
&lt;h2&gt;Row-Level Security Through Query Engine Policies&lt;/h2&gt;
&lt;p&gt;Row-level security (RLS) filters rows at query time based on the user or role running the query. An analyst with a regional role sees only the rows for their region. A compliance officer sees all rows. The same table serves both.&lt;/p&gt;
&lt;p&gt;Most Iceberg-compatible engines implement RLS through a policy layer that rewrites queries at execution time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AWS Lake Formation&lt;/strong&gt; integrates with Iceberg tables on S3 and supports cell-level security: filtering by row conditions and masking specific columns. It works at the storage access level, which means the policy applies regardless of whether the engine is Athena, EMR Spark, or Glue.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dremio&lt;/strong&gt; implements RLS and column masking through user-defined functions (UDFs) applied to virtual datasets. A virtual dataset (VDS) is a SQL view defined in Dremio&apos;s semantic layer. You define the masking or filtering logic once in the VDS, and every query against that virtual dataset goes through the access control logic. Users querying through Dremio can&apos;t bypass the VDS to reach the raw table unless they have direct table permissions.&lt;/p&gt;
&lt;p&gt;The UDF-based approach in Dremio is flexible. You can write masking functions that partially expose data :  showing the last four digits of a credit card number, or replacing an email domain with &lt;code&gt;***.***&lt;/code&gt; ,  rather than fully hiding the column. The function gets the user&apos;s role from the session context and applies the appropriate transformation.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Example Dremio column masking UDF
CREATE FUNCTION mask_email(email VARCHAR, user_role VARCHAR)
RETURNS VARCHAR
AS IF user_role IN (&apos;admin&apos;, &apos;compliance&apos;) THEN email
   ELSE REGEXP_REPLACE(email, &apos;@.*&apos;, &apos;@[redacted]&apos;)
   END;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/polaris-access-control-flow.png&quot; alt=&quot;Apache Polaris credential vending and access control flow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Column-Level Masking for PII Compliance&lt;/h2&gt;
&lt;p&gt;PII masking at the column level requires the engine to substitute values based on user identity before returning results. The mask applies transparently : the analyst runs a normal SELECT and gets masked values without needing to know the masking rule exists.&lt;/p&gt;
&lt;p&gt;Effective PII masking in an Iceberg environment requires:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A catalog with column-level metadata&lt;/strong&gt; that identifies which columns contain PII. Dremio&apos;s semantic layer supports wiki annotations and labels on columns. You can label a column as &lt;code&gt;PII: Email&lt;/code&gt; and use that metadata to drive masking policies programmatically.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A consistent enforcement point&lt;/strong&gt; that every user or application must pass through. If analysts can connect directly to the Iceberg catalog via Spark without going through Dremio, the Dremio masking policy doesn&apos;t protect them.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An audit trail&lt;/strong&gt; that records who accessed which columns and when. Compliance frameworks require demonstrating that PII was not accessed by unauthorized parties, not just that you had a masking policy in place.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/the-brain-of-the-agentic-lakehouse-inside-dremios-open-catalog-architecture/&quot;&gt;fine-grained access control&lt;/a&gt; covers all three requirements through its Open Catalog. Tables cataloged in Dremio have column-level labels. Access through Dremio enforces the masking policy. Every query is logged with the user identity, query text, and accessed columns.&lt;/p&gt;
&lt;h2&gt;The Governance Stack in Practice&lt;/h2&gt;
&lt;p&gt;For most enterprise Iceberg deployments, the realistic access control stack looks like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What It Controls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;S3 IAM or Azure RBAC&lt;/td&gt;
&lt;td&gt;Who can access the bucket at all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog&lt;/td&gt;
&lt;td&gt;Apache Polaris / Dremio Open Catalog&lt;/td&gt;
&lt;td&gt;Which tables each engine can discover and access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine&lt;/td&gt;
&lt;td&gt;Dremio VDS + UDFs&lt;/td&gt;
&lt;td&gt;Row filtering and column masking per user role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metadata&lt;/td&gt;
&lt;td&gt;Dremio Wikis and Labels&lt;/td&gt;
&lt;td&gt;PII classification and policy metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The catalog layer enforces the coarse boundary. The engine layer enforces the fine-grained access. The metadata layer provides the classification that drives policies.&lt;/p&gt;
&lt;p&gt;Start by centralizing access through a single query engine for internal users. Expose raw Iceberg files only to trusted, audited processes. Build the VDS masking layer for any table containing regulated data, and verify that the audit log captures every query before you call the governance model complete.&lt;/p&gt;
&lt;h2&gt;AI Agents and Access Control&lt;/h2&gt;
&lt;p&gt;AI agents querying your Iceberg tables are subject to the same access control requirements as human analysts : they need to be scoped to the data they&apos;re authorized to see, and their queries need to appear in your audit log.&lt;/p&gt;
&lt;p&gt;The critical question for any AI agent deployment: what identity does the agent run as? An agent that runs with admin-level credentials is a governance gap, not a governed tool. Give AI agents their own service principal in Polaris, assign that principal to a role with specific table permissions, and review what that role can access before deploying the agent to production.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;MCP server&lt;/a&gt; issues queries on behalf of the connected agent using the session credentials the agent presents. An agent connected with an analyst-role PAT gets analyst-level access through the same masking policies that apply to human analysts. The agent literally cannot see more data than an analyst in the same role.&lt;/p&gt;
&lt;p&gt;This is the architecture that makes AI-driven analytics safe for regulated data environments : not because the AI is inherently trustworthy, but because the access control system enforces the same policies regardless of whether the requester is human or automated.&lt;/p&gt;
&lt;h2&gt;Testing Your Access Control Setup&lt;/h2&gt;
&lt;p&gt;Access control policies need testing before production. The most common governance gap: a policy is configured but not actually enforced because the engine connects with a credential that bypasses the masking layer.&lt;/p&gt;
&lt;p&gt;Test your setup with a low-privilege service account that should be masked from PII columns. Run a query that would normally return sensitive data. Verify the masking output is what you expect. Then test with a privileged account that should see unmasked data, and verify that works too.&lt;/p&gt;
&lt;p&gt;For row-level filtering, test with two identities that should see different rows, and confirm neither identity sees rows from the other&apos;s scope. Document the test results and keep them as evidence that the policy was verified before data went into production.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and explore the full governance stack for your Iceberg lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Designing an Immutable Data Lakehouse: Best Practices for Iceberg Snapshot Expiration</title><link>https://iceberglakehouse.com/posts/iceberg-snapshot-expiration/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-snapshot-expiration/</guid><description>
# Designing an Immutable Data Lakehouse: Best Practices for Iceberg Snapshot Expiration

Iceberg tables accumulate snapshots by design. Every write :...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Designing an Immutable Data Lakehouse: Best Practices for Iceberg Snapshot Expiration&lt;/h1&gt;
&lt;p&gt;Iceberg tables accumulate snapshots by design. Every write :  every INSERT, UPDATE, DELETE, or compaction ,  creates a new snapshot. That&apos;s how Iceberg provides time travel, rollback, and concurrent reads without locks. It&apos;s a good feature, until you never clean it up.&lt;/p&gt;
&lt;p&gt;A production Iceberg table that takes 100 writes per day accumulates 36,500 snapshots in a year. Each snapshot points to manifest files, which point to data files. The metadata scan that precedes every query has to process all of that history unless you expire the snapshots that fall outside your retention window.&lt;/p&gt;
&lt;p&gt;This guide covers how to design a snapshot expiration policy that keeps tables clean without breaking active queries or compliance requirements.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-snapshot-expiration-lifecycle.png&quot; alt=&quot;Iceberg snapshot lifecycle and expiration workflow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What Snapshot Accumulation Actually Costs You&lt;/h2&gt;
&lt;p&gt;The cost of snapshot accumulation is not storage : it&apos;s query planning time.&lt;/p&gt;
&lt;p&gt;When a query engine reads an Iceberg table, it starts by reading the metadata: the table metadata JSON, then the manifest list for the current snapshot, then the manifests that list the relevant data files. If your table has thousands of manifests from thousands of historical snapshots, the planner has to navigate that graph even though it only needs the current snapshot&apos;s files.&lt;/p&gt;
&lt;p&gt;The practical symptom: queries on the same data volume get slower over months as the metadata layer grows. A table that returns results in 2 seconds when first deployed may take 10 seconds a year later, with the same data, the same query, and the same compute.&lt;/p&gt;
&lt;p&gt;Storage is a secondary cost. Old snapshot manifests don&apos;t compress well and don&apos;t share file references efficiently. A table with 12 months of unexpired snapshots typically carries 3–5x the metadata size of a well-maintained table with a 7-day retention window.&lt;/p&gt;
&lt;h2&gt;Building a Snapshot Retention Policy&lt;/h2&gt;
&lt;p&gt;A useful retention policy answers three questions: how old can a snapshot be, how many snapshots do you always keep, and how long do your longest-running queries run?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time-based expiration&lt;/strong&gt; sets the maximum age for snapshots. A 7-day window covers most analytical rollback needs. If you need to recover from a bad ETL job that ran 5 days ago, 7 days gives you that option. Going longer increases metadata overhead proportionally.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Count-based floor&lt;/strong&gt; ensures you always keep a minimum number of recent snapshots regardless of how quickly they were generated. A high-frequency streaming table might generate 1,000 snapshots in a single day. A 7-day window without a count floor would expire all of them by day 8. Setting &lt;code&gt;retainLast = 10&lt;/code&gt; guarantees at least 10 snapshots survive, giving you recent rollback options even during high-write periods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query duration safety buffer&lt;/strong&gt; is the constraint people miss. If a read query starts at timestamp T and a maintenance job expires the snapshot the query is reading at timestamp T+30 minutes, the query fails. Your snapshot retention window must be longer than your longest-running query. If your p99 query takes 4 hours, expire nothing newer than 6 hours ago.&lt;/p&gt;
&lt;p&gt;The table properties that enforce these rules:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE my_catalog.my_schema.my_table
SET TBLPROPERTIES (
  &apos;history.expire.min-snapshots-to-keep&apos; = &apos;10&apos;,
  &apos;history.expire.max-snapshot-age-ms&apos; = &apos;604800000&apos;  -- 7 days in ms
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Full Maintenance Sequence&lt;/h2&gt;
&lt;p&gt;Snapshot expiration is step one of a four-step maintenance sequence. Running them in order matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Expire Snapshots&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Remove snapshot references outside your retention window. This doesn&apos;t delete physical files yet : it just removes the metadata pointers.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL iceberg.system.expire_snapshots(
  table =&amp;gt; &apos;my_catalog.my_schema.my_table&apos;,
  older_than =&amp;gt; TIMESTAMP &apos;2026-05-21 00:00:00&apos;,
  retain_last =&amp;gt; 10
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Remove Orphan Files&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After snapshot expiration, some physical data files may no longer be referenced by any remaining snapshot. These are orphan files. Remove them with a safety buffer : the default is 3 days, which prevents deleting files that an active write job just created.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL iceberg.system.remove_orphan_files(
  table =&amp;gt; &apos;my_catalog.my_schema.my_table&apos;,
  older_than =&amp;gt; TIMESTAMP &apos;2026-05-25 00:00:00&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Rewrite Manifests&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;As snapshots are added and removed, manifest files fragment. Many manifests end up with only a few file references each. The planner has to open more manifest files to find the same number of data files, which slows scan planning.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL iceberg.system.rewrite_manifests(
  table =&amp;gt; &apos;my_catalog.my_schema.my_table&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Compact Data Files&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Small file proliferation :  common in streaming ingestion ,  forces the engine to open thousands of files to scan the same amount of data. Compaction merges them.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL iceberg.system.rewrite_data_files(
  table =&amp;gt; &apos;my_catalog.my_schema.my_table&apos;,
  options =&amp;gt; map(&apos;target-file-size-bytes&apos;, &apos;134217728&apos;)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run all four steps together in a scheduled job, at the frequency your table&apos;s write rate requires. High-frequency streaming tables may need nightly maintenance. Batch tables written once a week may only need monthly cleanup.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-table-health-monitoring.png&quot; alt=&quot;Iceberg table health metrics monitoring&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Monitoring Table Health&lt;/h2&gt;
&lt;p&gt;You can check snapshot and manifest counts directly through Iceberg metadata tables. These queries don&apos;t require any external monitoring tool : they run in any SQL engine connected to your Iceberg catalog.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Check snapshot count
SELECT COUNT(*) AS snapshot_count
FROM my_catalog.my_schema.my_table.snapshots;

-- Check manifest fragmentation
SELECT COUNT(*) AS manifest_count
FROM my_catalog.my_schema.my_table.manifests;

-- Check file size distribution
SELECT
  COUNT(*) AS file_count,
  AVG(file_size_in_bytes) AS avg_file_size,
  MIN(file_size_in_bytes) AS min_file_size,
  MAX(file_size_in_bytes) AS max_file_size
FROM my_catalog.my_schema.my_table.files;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set an alert if &lt;code&gt;snapshot_count&lt;/code&gt; exceeds your expected range, or if &lt;code&gt;avg_file_size&lt;/code&gt; drops below 50 MB (a sign of small file accumulation).&lt;/p&gt;
&lt;h2&gt;Automating Maintenance with Dremio&lt;/h2&gt;
&lt;p&gt;Running maintenance manually is unsustainable at scale. Dremio&apos;s Automatic Table Optimization runs compaction, manifest rewriting, and orphan file removal as background jobs on tables managed through its &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-delivers-an-apache-iceberg-lakehouse-without-the-headaches/&quot;&gt;Open Catalog&lt;/a&gt;. You configure the policy once per table, and the platform handles execution.&lt;/p&gt;
&lt;p&gt;For tables outside Dremio&apos;s managed catalog, you can run the maintenance procedures above through Dremio&apos;s SQL interface as scheduled queries, or orchestrate them through Airflow or similar schedulers.&lt;/p&gt;
&lt;p&gt;The tradeoff with external orchestration: you take responsibility for sequencing the steps correctly and monitoring for failures. A maintenance job that crashes halfway through :  after expiring snapshots but before removing orphan files ,  leaves the table in a partially cleaned state. Make sure your orchestrator retries failed steps safely.&lt;/p&gt;
&lt;h2&gt;Compliance and Retention: Navigating the Conflict&lt;/h2&gt;
&lt;p&gt;Snapshot expiration conflicts directly with some compliance frameworks. GDPR&apos;s right-to-erasure requirement says user data must be deleted within 30 days of a valid request. If your 90-day time travel window includes records with PII, you can&apos;t delete the snapshot that contains them without also deleting your time travel history.&lt;/p&gt;
&lt;p&gt;There are two approaches to this conflict:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PII-separate tables:&lt;/strong&gt; Store PII in a separate Iceberg table with a short retention window (7 days). The main analytical table contains only anonymized or tokenized identifiers. Deletion requests affect only the PII table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Short retention windows for PII tables:&lt;/strong&gt; If PII data must co-exist with analytical data in the same table, set your retention window to the minimum that satisfies your operational rollback needs : often 48–72 hours. This means you can process deletion requests within 72 hours and the snapshot containing the deleted record expires within the retention window.&lt;/p&gt;
&lt;p&gt;Document your retention decisions in your data catalog. Auditors reviewing your GDPR compliance will want to see that snapshot retention windows were chosen deliberately, with explicit consideration of the deletion timelines they enable.&lt;/p&gt;
&lt;h2&gt;Scheduling Maintenance Without Impacting Query Performance&lt;/h2&gt;
&lt;p&gt;Snapshot expiration and compaction are write operations that temporarily lock table metadata. On a busy table with continuous reads, scheduling maintenance during peak query hours will cause planning delays.&lt;/p&gt;
&lt;p&gt;Schedule maintenance jobs during your low-traffic window : typically early morning for business-hours workloads, or midday for overnight batch workloads. For Dremio&apos;s Automatic Table Optimization, you can configure the maintenance window per table. The platform respects active queries and queues maintenance work rather than interrupting running reads.&lt;/p&gt;
&lt;p&gt;Start with a 7-day retention policy, a count floor of 10, and weekly maintenance runs. Adjust the frequency based on how fast your &lt;code&gt;snapshot_count&lt;/code&gt; and &lt;code&gt;manifest_count&lt;/code&gt; metrics grow.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to run Iceberg tables with automated maintenance built in.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Decoupling Storage and Compute in Apache Iceberg: A Deep Dive into Cost Optimization</title><link>https://iceberglakehouse.com/posts/iceberg-storage-compute-decoupling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-storage-compute-decoupling/</guid><description>
# Decoupling Storage and Compute in Apache Iceberg: A Cost Optimization Deep Dive

Most proprietary data warehouses bundle their storage and compute ...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Decoupling Storage and Compute in Apache Iceberg: A Cost Optimization Deep Dive&lt;/h1&gt;
&lt;p&gt;Most proprietary data warehouses bundle their storage and compute into a single product. You buy the system, and you get both : at a price the vendor sets. Apache Iceberg breaks that model by treating storage and compute as separate, independently scalable concerns. That separation is the technical foundation for most of the cost advantages people attribute to data lakehouses.&lt;/p&gt;
&lt;p&gt;This post explains exactly how Iceberg achieves that decoupling, what it costs to maintain (because there are real operational requirements), and how to route workloads across engines to get the best cost-to-performance ratio.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-storage-compute-decoupling.png&quot; alt=&quot;Apache Iceberg storage compute decoupling architecture&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Apache Iceberg Storage Compute Decoupling: The Core Mechanism&lt;/h2&gt;
&lt;p&gt;Traditional warehouses store data in proprietary formats tied to their internal engine. If you want to run Spark, you copy the data. If you want Snowflake and BigQuery on the same dataset, you maintain two copies. That&apos;s expensive, and keeping them in sync requires pipelines that add latency.&lt;/p&gt;
&lt;p&gt;Iceberg stores data in open file formats :  primarily Parquet ,  on commodity object storage (S3, GCS, Azure Data Lake Storage). The Iceberg spec defines a metadata layer on top of those files. Every engine that reads the metadata understands the table structure, partition layout, schema history, and file locations. Spark, Trino, Flink, Dremio, and Snowflake can all read the same Iceberg tables without any data movement.&lt;/p&gt;
&lt;p&gt;The metadata layer is what makes this work. It tracks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which data files exist and where they are&lt;/li&gt;
&lt;li&gt;Min/max statistics for each column in each file (used for file pruning)&lt;/li&gt;
&lt;li&gt;Partition boundaries&lt;/li&gt;
&lt;li&gt;Schema history, including column additions and type changes&lt;/li&gt;
&lt;li&gt;Snapshot history for time travel&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a query engine plans a query, it reads the metadata to identify exactly which files it needs to scan. If your query filters on &lt;code&gt;region = &apos;US&apos;&lt;/code&gt; and the table is partitioned by region, the planner skips every non-US file before it reads a single byte of actual data. That predicate pushdown saves compute time, which saves money.&lt;/p&gt;
&lt;h2&gt;The Multi-Engine Routing Advantage&lt;/h2&gt;
&lt;p&gt;With Iceberg decoupling storage from compute, you can route different workloads to the engine that processes them most cost-effectively.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Heavy batch ELT:&lt;/strong&gt; Use Apache Spark on spot instances. Spot pricing runs 70–90% cheaper than on-demand for batch workloads that can tolerate interruption. Spark on object storage with Iceberg is a standard pattern for this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Interactive SQL and dashboards:&lt;/strong&gt; Use a high-performance engine like Dremio with its Columnar Cloud Cache (C3) and Reflections. Sub-second queries on the same Parquet files that Spark wrote. No copying.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Streaming ingestion:&lt;/strong&gt; Use Apache Flink to write Iceberg tables in real time. The same tables are then queryable by interactive engines without a separate serving layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data science:&lt;/strong&gt; Python notebooks via PyIceberg read the same tables directly. No exports to CSV or separate data marts.&lt;/p&gt;
&lt;p&gt;Every engine reads from the same underlying files in your S3 bucket. You pay object storage rates (roughly $0.02–$0.025 per GB/month), not the compute markup that proprietary warehouses build into their storage tiers.&lt;/p&gt;
&lt;p&gt;The tradeoff: you&apos;re now responsible for choosing and configuring multiple engines. That operational overhead is real. If your team has 10 people and needs one SQL tool that works, a fully managed warehouse might be simpler. If you&apos;re running petabytes with diverse workload types, the cost savings from multi-engine routing are substantial.&lt;/p&gt;
&lt;h2&gt;Where the Hidden Costs Live&lt;/h2&gt;
&lt;p&gt;Decoupling isn&apos;t free. The storage layer requires active maintenance to avoid &amp;quot;metadata traps&amp;quot; that gradually erode performance and increase costs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Small files:&lt;/strong&gt; Every streaming micro-batch write generates small Parquet files. Reading thousands of 10 MB files is slower than reading dozens of 1 GB files, and each file adds metadata overhead. Left unaddressed, small file accumulation causes query planning time to grow even on the same data volume. Run periodic compaction to merge small files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snapshot bloat:&lt;/strong&gt; Every write to an Iceberg table creates a new snapshot. Snapshots let you time travel and roll back, but they accumulate. A table that takes 100 writes per day has 36,500 snapshots after a year. Expire snapshots older than your retention window. A 7-day window with a floor of 10 retained snapshots is a common starting point.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Orphan files:&lt;/strong&gt; Compaction rewrites files but the old files aren&apos;t deleted until you run orphan file cleanup. Run &lt;code&gt;remove_orphan_files&lt;/code&gt; weekly with a 3-day safety buffer to avoid deleting files currently being written.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manifest fragmentation:&lt;/strong&gt; As snapshots accumulate, manifest files fragment. &lt;code&gt;rewrite_manifests&lt;/code&gt; consolidates them and speeds up scan planning.&lt;/p&gt;
&lt;p&gt;If you don&apos;t run these maintenance jobs, your storage costs grow and your query planning time increases. Platforms like Dremio automate this through Automatic Table Optimization, which runs compaction, manifest rewriting, and snapshot expiration as background jobs on tables in the Open Catalog.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/iceberg-cost-optimization-maintenance.png&quot; alt=&quot;Iceberg cost optimization maintenance workflow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;A TCO Framework for the Iceberg Approach&lt;/h2&gt;
&lt;p&gt;To compare an Iceberg-based lakehouse against a proprietary warehouse, measure these four components:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Storage cost:&lt;/strong&gt; Object storage at market rates vs. the vendor&apos;s per-TB storage price. Most proprietary warehouses charge 3–5x the raw S3 rate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compute cost:&lt;/strong&gt; Engine-specific compute rates for your workload mix. Interactive queries, batch jobs, and streaming have different compute profiles. Route each to the cheapest engine that meets the SLA.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational cost:&lt;/strong&gt; Maintenance automation reduces this, but it&apos;s never zero. Factor in the cost of running and monitoring maintenance jobs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Engineering cost:&lt;/strong&gt; Multiple engines mean multiple areas of expertise. A team that already knows Spark and SQL has lower engineering overhead than a team learning three new systems.&lt;/p&gt;
&lt;p&gt;For most teams running at over 5 TB of data with a mix of batch, streaming, and interactive workloads, the Iceberg-based approach is cheaper over a 3-year period. The breakeven depends heavily on how much of your workload is interactive : interactive queries are where purpose-built engines like Dremio earn their cost through speed, not raw compute cheapness.&lt;/p&gt;
&lt;h2&gt;Using Dremio as the Interactive Layer&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/why-agentic-analytics-requires-federation-virtualization-and-the-lakehouse-how-dremio-delivers/&quot;&gt;query federation&lt;/a&gt; connects to your Iceberg tables without copying data. Its Reflections feature creates pre-computed, optimized subsets of your most-queried data. Autonomous Reflections learns from query patterns over the last 7 days and creates those optimizations automatically.&lt;/p&gt;
&lt;p&gt;The result is sub-second query response on data stored in your own S3 bucket at S3 rates. The query engine is separate from the storage. You pay for compute only when queries run.&lt;/p&gt;
&lt;h2&gt;Understanding When Decoupling Doesn&apos;t Pay Off&lt;/h2&gt;
&lt;p&gt;Storage-compute decoupling delivers real cost advantages at scale, but there are scenarios where the tradeoffs don&apos;t work in your favor.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Small datasets under 500 GB:&lt;/strong&gt; The operational overhead of running Iceberg maintenance, configuring multiple engines, and managing catalog infrastructure is a fixed cost. At small data volumes, a managed cloud warehouse often costs less in total : especially when engineering time is factored in. Iceberg decoupling starts showing ROI at the terabyte scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Single-engine shops:&lt;/strong&gt; If your entire workload is interactive SQL queries run by business analysts, you don&apos;t need multi-engine routing. You pay for one engine, and that engine handles everything. The decoupling benefit :  routing different workloads to different engines ,  doesn&apos;t apply. In this case, evaluate whether the Iceberg format still makes sense for future flexibility, but don&apos;t architect for multi-engine routing you won&apos;t use.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Teams without operational capacity:&lt;/strong&gt; Running Iceberg without automated maintenance requires someone who understands the metadata model and can monitor table health. If no one on your team has Iceberg expertise, factor in the learning curve and operational risk before committing to the architecture.&lt;/p&gt;
&lt;p&gt;The honest summary: Iceberg storage-compute decoupling is a powerful cost tool at 5+ TB with diverse workloads. Below that threshold, evaluate the total operational cost carefully before abandoning a managed warehouse.&lt;/p&gt;
&lt;h2&gt;Governance and Access Control Across Engines&lt;/h2&gt;
&lt;p&gt;One practical complication of multi-engine Iceberg deployments: different engines may enforce access control differently. Spark reads the Iceberg catalog but applies its own security model. Trino has its own authentication layer. Dremio enforces RBAC through the Open Catalog.&lt;/p&gt;
&lt;p&gt;If you run multiple engines on the same tables, verify that your access control is enforced at the catalog level : not just at the engine level. Catalog-level RBAC through Apache Polaris means that no engine can read a restricted table, regardless of which tool the user connects with.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s credential vending model integrates tightly with Polaris RBAC: when any engine requests file locations for a table, the catalog checks the caller&apos;s permissions before returning signed access credentials. The data files themselves are inaccessible without those credentials, so even a direct S3 API call won&apos;t work for unauthorized users.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your Iceberg tables through a high-performance query engine without moving your data.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Legacy Warehouses to Open Lakehouses: A Step-by-Step Migration Playbook</title><link>https://iceberglakehouse.com/posts/legacy-warehouse-to-lakehouse-migration/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/legacy-warehouse-to-lakehouse-migration/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-legacy-warehouse-to-lakehouse-mi...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-legacy-warehouse-to-lakehouse-migration/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Legacy Warehouses to Open Lakehouses: A Step-by-Step Migration Playbook&lt;/h1&gt;
&lt;p&gt;Most teams that start a warehouse-to-lakehouse migration underestimate one thing: the actual problem is trust, not technology. Your stakeholders have dashboards that have been running the same numbers for years. The moment those numbers change : even correctly , you&apos;ve got a political problem.&lt;/p&gt;
&lt;p&gt;The technical migration is solvable. The trust migration is harder. This playbook handles both.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/warehouse-to-lakehouse-migration.png&quot; alt=&quot;Data warehouse to open lakehouse migration phases&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Teams Migrate Now&lt;/h2&gt;
&lt;p&gt;The economics of staying on a proprietary warehouse have shifted. Storage costs in legacy warehouses run 3–5x what the same data costs on object storage with Iceberg. Compute can&apos;t scale independently when it&apos;s bundled with storage. SQL on Iceberg has reached performance parity with managed warehouses for most analytical workloads, especially with a query engine like Dremio that adds Reflections-based acceleration.&lt;/p&gt;
&lt;p&gt;The second driver is AI. Teams building agentic analytics need open catalogs with semantic metadata. Proprietary warehouses have governance models designed for human analysts : schema-level permissions, not the column-level masking and contextual documentation that AI agents need to generate accurate SQL.&lt;/p&gt;
&lt;p&gt;The migration isn&apos;t about abandoning what works. It&apos;s about building a foundation that doesn&apos;t trap you in one vendor&apos;s pricing model.&lt;/p&gt;
&lt;h2&gt;Phase 1: Inventory Everything&lt;/h2&gt;
&lt;p&gt;Before moving anything, document what you have.&lt;/p&gt;
&lt;p&gt;Catalog every table, view, stored procedure, and ETL job in your current warehouse. For each, record:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Row count and data volume&lt;/li&gt;
&lt;li&gt;Write frequency (real-time, hourly batch, daily batch)&lt;/li&gt;
&lt;li&gt;Query frequency and peak concurrent users&lt;/li&gt;
&lt;li&gt;BI tools that depend on it&lt;/li&gt;
&lt;li&gt;Data owners and compliance classifications (PII, regulated, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This inventory has two purposes. First, it tells you the scope of the migration. Second, it tells you the order in which to migrate : starting with high-value, lower-risk tables rather than the mission-critical ones that business stakeholders watch daily.&lt;/p&gt;
&lt;p&gt;Sort your inventory into three categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Migrate first:&lt;/strong&gt; High query volume, non-sensitive, well-documented&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate second:&lt;/strong&gt; Important but complex, either high sensitivity or complex dependencies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate last or keep:&lt;/strong&gt; Mission-critical financial or compliance reporting where rollback risk is highest&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Phase 2: Design the Lakehouse Architecture&lt;/h2&gt;
&lt;p&gt;Map your current tables to a Medallion architecture before writing any migration code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bronze layer:&lt;/strong&gt; Raw, typed views mapping directly to source system data. Minimal transformation. One-to-one with source tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Silver layer:&lt;/strong&gt; Joins, business logic, and filter conditions. This is where &amp;quot;active customer&amp;quot; and &amp;quot;churn rate&amp;quot; get their canonical definitions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gold layer:&lt;/strong&gt; Aggregated, application-specific views for specific users, teams, or AI use cases.&lt;/p&gt;
&lt;p&gt;The Medallion mapping forces you to decide where transformations live. In a legacy warehouse, business logic accumulates in stored procedures, views, and ETL code spread across systems. The migration is your opportunity to consolidate it in the silver layer as SQL-defined virtual datasets.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/&quot;&gt;semantic layer&lt;/a&gt; handles this: virtual datasets are SQL views defined in the platform, versioned, and documented with wikis. Every downstream tool : dashboards, notebooks, AI agents , reads from the same logical definitions.&lt;/p&gt;
&lt;h2&gt;Phase 3: Run a Lighthouse Migration&lt;/h2&gt;
&lt;p&gt;Pick one dataset from your &amp;quot;migrate first&amp;quot; category and run the full migration as a proof of concept.&lt;/p&gt;
&lt;p&gt;Set up your Iceberg catalog (Apache Polaris or Dremio&apos;s Open Catalog). Create the target Iceberg table schema. Run the initial load : either through Spark, Dremio&apos;s ingestion tools, or your ETL framework of choice. Then run both the legacy warehouse and the new lakehouse in parallel for at least two weeks.&lt;/p&gt;
&lt;p&gt;During the parallel run:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Compare query outputs between the two systems for every report that uses this dataset&lt;/li&gt;
&lt;li&gt;Document any discrepancies and trace them to root causes&lt;/li&gt;
&lt;li&gt;Measure query performance in the new system against baseline&lt;/li&gt;
&lt;li&gt;Confirm that all connected BI tools work against the new data source&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The parallel run is where trust gets built. When your finance team sees the same number in both systems for 14 consecutive days, they&apos;ll accept the cutover.&lt;/p&gt;
&lt;h2&gt;Phase 4: Migrate in Waves&lt;/h2&gt;
&lt;p&gt;After the lighthouse migration validates the pattern, apply it in waves across the rest of your inventory.&lt;/p&gt;
&lt;p&gt;Wave 1 covers your &amp;quot;migrate first&amp;quot; category. These go through the same parallel run process, but the pattern is now established and the team moves faster.&lt;/p&gt;
&lt;p&gt;Wave 2 covers complex tables. These often require schema refactoring : long-accumulated technical debt that doesn&apos;t survive the migration unchanged. Plan for extra time on schema cleanup and downstream impact analysis.&lt;/p&gt;
&lt;p&gt;Wave 3 covers mission-critical tables. Run these in parallel for longer : 30 days minimum. Get explicit sign-off from business stakeholders before cutover.&lt;/p&gt;
&lt;p&gt;Don&apos;t decommission legacy tables immediately after cutover. Keep them available (read-only) for 60 days with clear documentation that they are no longer the source of truth. This gives you a rollback path and reduces the urgency pressure on your team.&lt;/p&gt;
&lt;h2&gt;Phase 5: Optimize and Govern&lt;/h2&gt;
&lt;p&gt;Once tables are on Iceberg, set up the maintenance schedule. Compaction, snapshot expiration, and manifest rewriting need to run regularly. Dremio&apos;s Automatic Table Optimization handles this for tables in its managed catalog.&lt;/p&gt;
&lt;p&gt;Build your governance layer in parallel with the migration, not after it. Every migrated table should have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Column-level PII labels&lt;/li&gt;
&lt;li&gt;Access control policies by role&lt;/li&gt;
&lt;li&gt;Wiki documentation describing what the table contains and how it&apos;s used&lt;/li&gt;
&lt;li&gt;Ownership assigned to a data domain team&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The lakehouse governance model can be more complete than what you had in the legacy warehouse, because open catalog systems like Apache Polaris and Dremio&apos;s Open Catalog support metadata that proprietary warehouses don&apos;t : including the semantic annotations that AI agents need to generate accurate queries.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/medallion-architecture-lakehouse.png&quot; alt=&quot;Medallion architecture bronze silver gold layers&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Biggest Migration Failure Mode&lt;/h2&gt;
&lt;p&gt;The most common migration failure isn&apos;t technical : it&apos;s running both systems too long without a cutover date.&lt;/p&gt;
&lt;p&gt;After 90 days of parallel running, teams start treating the legacy warehouse as the authority again because it&apos;s &amp;quot;proven.&amp;quot; The lakehouse becomes a shadow system. Data engineers maintain both indefinitely.&lt;/p&gt;
&lt;p&gt;Set a hard cutover date for each table during the parallel run. Build consensus with stakeholders before the run starts, not after. If the parallel run surfaces discrepancies, fix them during the run : don&apos;t extend the timeline indefinitely.&lt;/p&gt;
&lt;h2&gt;Handling BI Tool Compatibility&lt;/h2&gt;
&lt;p&gt;Your BI tools : Tableau, Power BI, Looker , all support JDBC and ODBC connections. If your legacy warehouse exposes standard SQL, switching to Dremio as the query layer requires only a connection string change, not a report rebuild.&lt;/p&gt;
&lt;p&gt;The exceptions: if you relied on warehouse-specific SQL functions (Snowflake&apos;s ARRAY_CONSTRUCT, Redshift&apos;s DATEADD, BigQuery&apos;s DATE_DIFF), those queries need rewriting in standard SQL or Dremio equivalents. Run a SQL audit on your most frequently executed dashboard queries before migration to identify any non-standard functions early.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://docs.dremio.com/current/reference/sql/&quot;&gt;SQL reference&lt;/a&gt; covers the full function set. Most standard analytics functions are supported directly. For the few warehouse-specific functions that don&apos;t have direct equivalents, virtual datasets let you implement them as SQL-defined macros that your BI tools call without modification.&lt;/p&gt;
&lt;p&gt;The good news: most Tableau and Power BI reports that use standard GROUP BY, JOIN, and aggregate functions work against Dremio unchanged. Your dashboard team won&apos;t notice the query engine changed : they&apos;ll just notice that their dashboards run faster.&lt;/p&gt;
&lt;h2&gt;Migrating Incremental Loads and Streaming Sources&lt;/h2&gt;
&lt;p&gt;Historical data is the easy part. Ongoing incremental loads are where migrations get complicated.&lt;/p&gt;
&lt;p&gt;Your legacy warehouse likely has ETL jobs that run on a schedule : nightly batch loads, hourly Kafka consumers, real-time CDC pipelines. Each of these needs a new destination configured before you cut over, and tested in parallel before you rely on it.&lt;/p&gt;
&lt;p&gt;For batch ETL, most frameworks (dbt, Airflow, Spark, Fivetran) support Iceberg as a write target. Swap the target from your legacy warehouse&apos;s connector to the Iceberg catalog connector and test the output against the legacy tables during your parallel run.&lt;/p&gt;
&lt;p&gt;For streaming sources, Apache Iceberg supports streaming writes through Flink and Spark Structured Streaming. The write semantics are different from traditional UPSERT : Iceberg uses append-by-default with merge-on-read for update scenarios. Verify that your streaming source&apos;s at-least-once delivery model interacts correctly with Iceberg&apos;s deduplication patterns before cutover.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your first lighthouse migration against a production-grade open lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building the Brain of the Agentic Lakehouse: Designing an Open Catalog Architecture</title><link>https://iceberglakehouse.com/posts/open-catalog-architecture-agentic-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/open-catalog-architecture-agentic-lakehouse/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-open-catalog-architecture-agenti...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-open-catalog-architecture-agentic-lakehouse/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Building the Brain of the Agentic Lakehouse: Designing an Open Catalog Architecture&lt;/h1&gt;
&lt;p&gt;An AI agent connected to a data platform needs to know three things before it can answer questions reliably: what data exists, what it means, and who is allowed to see it. In an agentic lakehouse, the catalog provides all three. Without a well-designed catalog, the agent is navigating blind.&lt;/p&gt;
&lt;p&gt;This post covers the architectural components of an open catalog designed for AI agent access, how Apache Polaris implements the open standard, and how Dremio&apos;s Open Catalog extends that foundation with the federation and governance features that production agentic systems require.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/open-catalog-agentic-architecture.png&quot; alt=&quot;Open catalog architecture for agentic lakehouse&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What the Catalog Does&lt;/h2&gt;
&lt;p&gt;The catalog&apos;s primary job in any data lakehouse is table discovery: it tracks which tables exist, where their metadata files live, and what their schemas are. This is the function that Apache Polaris implements through the Iceberg REST catalog specification.&lt;/p&gt;
&lt;p&gt;But for agentic analytics, table discovery is only the foundation. The catalog also needs to provide:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Business context:&lt;/strong&gt; Not just &amp;quot;this table has 12 columns of these types&amp;quot; but &amp;quot;this table contains daily order aggregates, updated at 2 AM UTC, authoritative source is the order management system, relevant for revenue and fulfillment analytics.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic relationships:&lt;/strong&gt; Which tables should be joined for certain types of questions? Which virtual dataset contains the canonical &amp;quot;active customer&amp;quot; definition? What is the relationship between the &lt;code&gt;orders&lt;/code&gt; table and the &lt;code&gt;customers&lt;/code&gt; table?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Access policies:&lt;/strong&gt; Which roles can see which tables, which columns get masked for which roles, which rows are filtered based on user context?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lineage:&lt;/strong&gt; Where did this data come from? Which pipelines produce it? Which downstream views and reports depend on it?&lt;/p&gt;
&lt;p&gt;A catalog that provides all four types of information gives AI agents the context they need to generate accurate, governed queries. Most standard Iceberg REST catalogs provide only table discovery. The agentic lakehouse requires all four.&lt;/p&gt;
&lt;h2&gt;Apache Polaris: The Open Standard Catalog&lt;/h2&gt;
&lt;p&gt;Apache Polaris is an open-source implementation of the Iceberg REST catalog specification, incubating in the Apache Software Foundation. It provides the standardized interface that Iceberg-compatible engines use to discover tables and get storage credentials.&lt;/p&gt;
&lt;p&gt;The key features Polaris provides:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RBAC with credential vending:&lt;/strong&gt; Service principals (engines, users, agents) authenticate to Polaris and receive scoped, short-lived credentials that allow access to specific table paths. An agent with a read-only role gets credentials that cover only the files in the tables it&apos;s authorized to access.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Namespace management:&lt;/strong&gt; Catalogs are organized into namespaces that map to your organizational structure. A namespace might represent a business domain (e.g., &lt;code&gt;finance&lt;/code&gt;, &lt;code&gt;operations&lt;/code&gt;, &lt;code&gt;customer_success&lt;/code&gt;) or a data tier (bronze, silver, gold).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-engine interoperability:&lt;/strong&gt; Because Polaris follows the open REST spec, any Iceberg-compatible engine can connect. Spark, Trino, Flink, and Dremio all speak the same catalog protocol.&lt;/p&gt;
&lt;p&gt;What Polaris doesn&apos;t provide natively: business context (wikis and documentation), row-level security, column masking, or federation to non-Iceberg sources. These require additional layers.&lt;/p&gt;
&lt;h2&gt;Dremio&apos;s Open Catalog: Polaris Plus&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Open Catalog extends Apache Polaris in two important directions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Federated sources:&lt;/strong&gt; Dremio&apos;s Open Catalog formula is: &lt;code&gt;Open Catalog = 1 Apache Polaris Catalog + Dremio Federated Sources&lt;/code&gt;. A single Dremio catalog namespace can include Iceberg tables in S3, PostgreSQL schemas, Snowflake warehouses, MongoDB collections, and Kafka streams. All of these appear in the same unified namespace, accessible through the same SQL interface.&lt;/p&gt;
&lt;p&gt;For AI agents, this means the catalog is the single authoritative source of what data exists. The agent doesn&apos;t need to know whether &amp;quot;active_customers&amp;quot; is stored in Iceberg or PostgreSQL : it queries the catalog, which routes the query appropriately.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic layer:&lt;/strong&gt; Every table in Dremio&apos;s Open Catalog can have wiki documentation, column labels, and linked virtual datasets. The AI agent can query the semantic layer directly: &amp;quot;What does this column mean? What business metric does this table support? Is this the authoritative source?&amp;quot;&lt;/p&gt;
&lt;p&gt;Dremio&apos;s AI metadata generation creates initial wiki drafts automatically by sampling table schemas and data. A data steward reviews and approves. The catalog becomes self-documenting over time, with human oversight ensuring accuracy.&lt;/p&gt;
&lt;h2&gt;Catalog-as-Agent-Context&lt;/h2&gt;
&lt;p&gt;In a well-designed agentic lakehouse, the catalog doesn&apos;t just store metadata : it actively serves context to the agent at query time.&lt;/p&gt;
&lt;p&gt;When an agent connects through Dremio&apos;s &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;MCP server&lt;/a&gt;, it receives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A description of available schemas and tables, with their wiki documentation&lt;/li&gt;
&lt;li&gt;Column-level descriptions and data type information&lt;/li&gt;
&lt;li&gt;Metadata about which virtual datasets represent canonical business metrics&lt;/li&gt;
&lt;li&gt;Access control context that determines which tables and columns the agent can query&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This catalog-as-context pattern means the agent doesn&apos;t need to explore the schema through multiple round-trips : the relevant context is provided upfront. Investigation still happens iteratively, but the agent starts with business-relevant context rather than generic database metadata.&lt;/p&gt;
&lt;p&gt;The quality of this context depends directly on the quality of the catalog documentation. Investing in documentation is investing in agent accuracy.&lt;/p&gt;
&lt;h2&gt;Open vs. Proprietary Catalog Design&lt;/h2&gt;
&lt;p&gt;The open catalog design (Apache Polaris + Dremio&apos;s extensions) contrasts with proprietary catalog approaches where the catalog metadata format is owned by the vendor.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Portability:&lt;/strong&gt; With an open catalog following the Iceberg REST spec, you can switch compute engines without migrating your catalog. Your semantic definitions, access policies, and table metadata stay in the catalog and are compatible with any new engine you add.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-engine agent access:&lt;/strong&gt; Multiple AI agents from different frameworks (LangChain, LlamaIndex, Anthropic tool use, Dremio&apos;s built-in agent) can all connect to the same open catalog and access the same context. You don&apos;t need to maintain separate semantic definitions per agent framework.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Auditability:&lt;/strong&gt; Open catalog access logs can be exported to any SIEM or audit system. Proprietary catalogs may limit log access to their own tooling.&lt;/p&gt;
&lt;p&gt;The tradeoff: building and maintaining an open catalog with rich semantic documentation requires engineering investment. A proprietary, managed catalog may have lower initial setup cost but higher long-term cost through lock-in and reduced portability.&lt;/p&gt;
&lt;h2&gt;Structuring the Catalog for Agent Use&lt;/h2&gt;
&lt;p&gt;When designing your catalog for AI agent access, organize namespaces around business domains rather than technical data layers.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;catalog/
├── finance/
│   ├── bronze/          # Raw financial data
│   ├── silver/          # Cleaned, governed financial metrics
│   └── gold/            # Report-ready financial aggregates
├── operations/
│   ├── bronze/
│   ├── silver/
│   └── gold/
└── customer_success/
    ├── bronze/
    ├── silver/
    └── gold/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each silver-layer virtual dataset should have a full wiki description, column-level labels, and access control policies. The agent is configured to query the silver layer by default, escalating to gold for specific reporting use cases.&lt;/p&gt;
&lt;p&gt;Limit the agent&apos;s table access to the tier appropriate for its role. An agent serving business stakeholder questions should query silver and gold, not bronze. An agent running data quality checks may need bronze access.&lt;/p&gt;
&lt;p&gt;The catalog namespace structure directly affects the agent&apos;s ability to navigate to the right data. Clear, consistent naming within each domain reduces exploration cost and reduces the probability of the agent selecting the wrong table.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to build and explore your open catalog architecture.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Evaluating the TCO of an Open Lakehouse vs. Proprietary Data Warehouses</title><link>https://iceberglakehouse.com/posts/open-lakehouse-vs-proprietary-warehouse-tco/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/open-lakehouse-vs-proprietary-warehouse-tco/</guid><description>
# Evaluating the TCO of an Open Lakehouse vs. Proprietary Data Warehouses

Before you sign a multiyear warehouse contract or commit to building an op...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Evaluating the TCO of an Open Lakehouse vs. Proprietary Data Warehouses&lt;/h1&gt;
&lt;p&gt;Before you sign a multiyear warehouse contract or commit to building an open lakehouse, you need the actual numbers. Not marketing claims : a breakdown of what each architecture costs at different scales, where the hidden charges accumulate, and at what point the economics of one approach overtake the other.&lt;/p&gt;
&lt;p&gt;This post gives you the framework to run that comparison for your specific workload.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/lakehouse-vs-warehouse-tco.png&quot; alt=&quot;Open lakehouse vs proprietary warehouse TCO comparison&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Components of Total Cost of Ownership&lt;/h2&gt;
&lt;p&gt;TCO comparisons fail when they only include the headline billing metrics. For warehouse vs. lakehouse, you need to account for:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Storage cost&lt;/strong&gt; , where your data lives and what it costs per GB/month&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute cost&lt;/strong&gt; : query execution, ingestion, and transformation costs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Egress cost&lt;/strong&gt; : fees for moving data out of or between systems&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engineering cost&lt;/strong&gt; : hours spent building, maintaining, and operating the platform&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Operational overhead&lt;/strong&gt; : governance, security configuration, maintenance automation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lock-in cost&lt;/strong&gt; : the cost of changing vendors or architectures later&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each of these plays out differently in a proprietary warehouse vs. an open lakehouse.&lt;/p&gt;
&lt;h2&gt;Storage: The Clearest Difference&lt;/h2&gt;
&lt;p&gt;Proprietary cloud data warehouses bundle storage into their offering. Snowflake charges approximately $23–40 per TB/month for storage (depending on tier and region). BigQuery charges $20 per TB/month on the flat-rate model. Both are significantly above raw object storage pricing.&lt;/p&gt;
&lt;p&gt;An open lakehouse stores data in Parquet files on S3, Azure Data Lake Storage, or GCS. S3 Standard pricing runs approximately $0.023 per GB/month : roughly $23 per TB/month, similar to warehouse pricing. The difference is what you get for that price.&lt;/p&gt;
&lt;p&gt;With a warehouse, the storage includes compute infrastructure (the warehouse&apos;s internal query engine, indexing, and caching). With object storage, you pay the raw storage rate and separately pay for the compute you use. For workloads where queries run infrequently, object storage is cheaper because you only pay for compute when queries run.&lt;/p&gt;
&lt;p&gt;At 100 TB, the raw storage cost is similar. At 1 PB, the proprietary warehouse often costs 3–5x more because most of that storage includes idle compute capacity that isn&apos;t being used.&lt;/p&gt;
&lt;h2&gt;Compute: The Nuanced Calculation&lt;/h2&gt;
&lt;p&gt;Warehouse compute is typically billed as credit consumption or warehouse-hours. A Snowflake X-Small warehouse (1 compute node) costs 1 credit per hour, with credit prices varying from $2–4 per credit depending on edition. BigQuery charges per byte scanned on the on-demand model.&lt;/p&gt;
&lt;p&gt;Open lakehouse compute is engine-specific. Dremio Cloud uses consumption-based billing : you pay for compute when queries run, not for idle time. Spark on spot instances runs 70–90% cheaper than on-demand for batch workloads. A multi-engine open lakehouse can route each workload to the cheapest engine that meets its SLA.&lt;/p&gt;
&lt;p&gt;The honest comparison: if your workload is primarily interactive BI with a predictable query pattern, a proprietary warehouse&apos;s all-inclusive pricing may be competitive because the vendor has optimized their engine for exactly that use case. If your workload is mixed :  streaming ingestion, batch ETL, interactive BI, ML feature engineering, and AI queries ,  the open lakehouse multi-engine routing saves money because you&apos;re not paying for warehouse-tier compute for batch workloads that don&apos;t need it.&lt;/p&gt;
&lt;h2&gt;The Hidden Costs of Proprietary Warehouses&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Compute markups on storage operations:&lt;/strong&gt; In Snowflake, loading data into your warehouse consumes credits. Moving data between tables, running COPY INTO operations, and automated clustering all consume credits. These operational costs are invisible until your first real production bill.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Egress fees:&lt;/strong&gt; Querying data stored in a different cloud region from your warehouse incurs cloud provider egress charges. Downloading query results to your BI tool over the public internet adds egress. These stack up at scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BI seat costs:&lt;/strong&gt; Some warehouse vendors include BI licensing or charge per-seat for certain analytics features. These are not part of the base compute or storage pricing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lock-in exit cost:&lt;/strong&gt; When you decide to migrate off a proprietary warehouse, you pay to export your data (egress), rebuild your ETL pipelines (engineering), and rewrite or retool your semantic definitions. This is a real cost that rarely appears in initial TCO calculations but matters significantly over a 5–7 year horizon.&lt;/p&gt;
&lt;h2&gt;The Hidden Costs of an Open Lakehouse&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Engineering time:&lt;/strong&gt; An open lakehouse requires engineering investment to configure, connect, and maintain multiple components. The catalog, the query engine, the ingestion pipeline, and the maintenance jobs all need setup. For a 3-person data team, this overhead is proportionally larger than for a 30-person team.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Maintenance automation:&lt;/strong&gt; Without automated compaction, snapshot expiration, and manifest rewriting, open Iceberg tables degrade in performance over time. Setting up and monitoring these maintenance jobs is an ongoing responsibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-engine expertise:&lt;/strong&gt; Using Spark for batch, Dremio for interactive, and Flink for streaming requires familiarity with three systems. That learning curve is real.&lt;/p&gt;
&lt;p&gt;The engineering cost advantage narrows as team size grows. At 5 engineers, the proprietary warehouse&apos;s managed simplicity may be worth the price premium. At 20 engineers, the open lakehouse&apos;s savings on compute and storage typically exceed the engineering overhead cost.&lt;/p&gt;
&lt;h2&gt;The Break-Even Analysis&lt;/h2&gt;
&lt;p&gt;A rough break-even model for a team running 10 TB of data:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cost Component&lt;/th&gt;
&lt;th&gt;Proprietary Warehouse&lt;/th&gt;
&lt;th&gt;Open Lakehouse&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage (10 TB)&lt;/td&gt;
&lt;td&gt;~$300/month&lt;/td&gt;
&lt;td&gt;~$230/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactive compute&lt;/td&gt;
&lt;td&gt;~$1,500/month&lt;/td&gt;
&lt;td&gt;~$800/month (Dremio)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch compute&lt;/td&gt;
&lt;td&gt;Included in compute above&lt;/td&gt;
&lt;td&gt;~$200/month (Spark spot)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineering overhead&lt;/td&gt;
&lt;td&gt;~20 hrs/month&lt;/td&gt;
&lt;td&gt;~40 hrs/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total at $100/hr&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$3,800/month&lt;/td&gt;
&lt;td&gt;~$3,230/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At 10 TB with a balanced workload, the open lakehouse is moderately cheaper. At 100 TB, the storage savings alone shift the calculation significantly. The engineering overhead stays roughly fixed regardless of data volume.&lt;/p&gt;
&lt;p&gt;The inflection point for most teams is around 20–50 TB with a mixed workload. Below that, managed warehouse simplicity may win on total cost. Above it, the open lakehouse almost always wins.&lt;/p&gt;
&lt;h2&gt;Starting the Comparison for Your Team&lt;/h2&gt;
&lt;p&gt;Run the break-even analysis with your actual numbers: your current storage volume, your query patterns (interactive hours vs. batch hours), and your team&apos;s hourly cost. Factor in your current egress bill if you&apos;re moving data between systems.&lt;/p&gt;
&lt;p&gt;Then build a 3-year model. Year 1 often favors the warehouse because the engineering investment for the lakehouse shows up in that period. Year 3 almost always favors the lakehouse because the compound savings on storage and compute have accumulated and the engineering investment is amortized.&lt;/p&gt;
&lt;h2&gt;What Happens When You Need to Switch Vendors&lt;/h2&gt;
&lt;p&gt;The lock-in cost of a proprietary warehouse compounds over time because both your data and your business logic are stored in proprietary formats.&lt;/p&gt;
&lt;p&gt;Snowflake stores data in its internal columnar format. Migrating out means exporting every table to Parquet or CSV (an egress event), then reimporting to the new system. At petabyte scale, that export takes weeks and generates substantial egress fees. Your stored procedures, views, and UDFs need rewriting in the new system&apos;s SQL dialect. Your BI tool connections need reconfiguration.&lt;/p&gt;
&lt;p&gt;With an open lakehouse built on Iceberg, switching the query engine is a catalog connection change. Your data stays in Parquet on S3. Your business logic, if defined as SQL virtual datasets in Dremio&apos;s Open Catalog, is standard SQL that any compliant engine can read. The switching cost is hours, not months.&lt;/p&gt;
&lt;p&gt;This portability has real options value even if you never switch vendors. The possibility of switching without a major migration program strengthens your negotiating position. Vendors offer better pricing and terms to customers who demonstrably aren&apos;t locked in.&lt;/p&gt;
&lt;h2&gt;The Organizational Cost That Doesn&apos;t Fit a Spreadsheet&lt;/h2&gt;
&lt;p&gt;One TCO component that&apos;s hard to quantify: the cost of slow data access and high query latency on organizational decisions.&lt;/p&gt;
&lt;p&gt;Teams that wait 20 minutes for a warehouse query to return don&apos;t ask that question again. They work around slow data with approximations, stale reports, or gut instinct. The cost shows up as missed opportunities and decisions made on incomplete information : not as a line item on your cloud bill.&lt;/p&gt;
&lt;p&gt;A data platform that consistently returns answers in under 2 seconds changes how analysts work. They ask more questions, explore more hypotheses, and validate more assumptions. That behavioral change has compounding value that pure infrastructure TCO models don&apos;t capture.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-delivers-an-apache-iceberg-lakehouse-without-the-headaches/&quot;&gt;5 Ways to Deliver an Apache Iceberg Lakehouse&lt;/a&gt; covers the platform-specific costs in more detail.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your actual workload against the open lakehouse to measure your real cost profile.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Real-Time BI: Enabling Sub-Second Queries on Apache Iceberg Data Lakehouses</title><link>https://iceberglakehouse.com/posts/real-time-bi-iceberg-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/real-time-bi-iceberg-lakehouse/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-real-time-bi-iceberg-lakehouse/)...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-real-time-bi-iceberg-lakehouse/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Real-Time BI: Enabling Sub-Second Queries on Apache Iceberg Data Lakehouses&lt;/h1&gt;
&lt;p&gt;The standard knock on cloud object storage for analytics is latency. S3 GET requests average 20–50 milliseconds each. A dashboard query that scans 10,000 files issues 10,000 of those requests, which means 3–8 minutes of wall time before the analyst sees a result. That&apos;s not a BI experience : it&apos;s a batch report.&lt;/p&gt;
&lt;p&gt;Sub-second interactive BI on Apache Iceberg is achievable, but it requires understanding which parts of the latency problem you&apos;re solving and which tools solve each part.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/real-time-bi-iceberg-architecture.png&quot; alt=&quot;Real-time BI architecture on Apache Iceberg with Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Three Sources of Latency on Object Storage&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;File scan latency:&lt;/strong&gt; Object storage is optimized for throughput, not random access. Reading many small files is slower than reading fewer large files, because each file requires a separate GET request with its own network round-trip.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata scan overhead:&lt;/strong&gt; Before reading data files, the query engine reads Iceberg metadata : the manifest list, then the manifests, then the file statistics. On a table with many snapshots and fragmented manifests, this overhead is measurable in seconds before any data is read.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data transfer latency:&lt;/strong&gt; Once the engine decides which files to read, it transfers them from the object store to compute memory. The transfer rate depends on the network bandwidth between the object store and compute nodes, and on whether any caching is in place.&lt;/p&gt;
&lt;p&gt;Each source has a different solution.&lt;/p&gt;
&lt;h2&gt;Solution 1: File Layout Optimization&lt;/h2&gt;
&lt;p&gt;The cheapest way to reduce file scan latency is to stop creating small files in the first place, and to compact them when they accumulate.&lt;/p&gt;
&lt;p&gt;Target file size for Iceberg tables optimized for analytical read workloads is 128 MB to 512 MB per file. Files smaller than 10 MB are a performance liability : they add metadata overhead and force the engine to issue more GET requests for the same amount of data.&lt;/p&gt;
&lt;p&gt;Partition your tables by the column most commonly used in query filters. If 80% of your queries filter by &lt;code&gt;region&lt;/code&gt; and &lt;code&gt;date&lt;/code&gt;, partition by region and date. The Iceberg planner will skip files in non-matching partitions without reading them at all. A query against a table with 1 million files might only scan 5,000 files after partition pruning : effectively a 200x reduction in scan work.&lt;/p&gt;
&lt;p&gt;For streaming data sources that write many small files continuously, run daily compaction to merge them. Dremio&apos;s Automatic Table Optimization handles this as a background job.&lt;/p&gt;
&lt;h2&gt;Solution 2: Columnar Cloud Cache (C3)&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Columnar Cloud Cache (C3) addresses data transfer latency by storing frequently-accessed data files on local NVMe SSDs at executor nodes. When a query requests a file that&apos;s in the cache, the engine reads from local disk instead of making an object storage GET request. Local NVMe read latency is measured in microseconds, not milliseconds.&lt;/p&gt;
&lt;p&gt;C3 caches at the sub-file level: it stores the specific column chunks within a Parquet file that queries have accessed, not the entire file. A table with 50 columns where dashboards consistently read 5 columns will cache those 5 columns. Cache efficiency is high because columnar access patterns are predictable.&lt;/p&gt;
&lt;p&gt;Cache hit rate depends on workload consistency. Dashboard queries that run the same report against the same time window every hour have near-100% cache hit rates after the first warm-up period. Ad-hoc queries across arbitrary date ranges have lower hit rates.&lt;/p&gt;
&lt;p&gt;The tradeoff: NVMe storage at executor nodes costs more than object storage. Size the cache based on your active dataset : the data that users actually query in the last 30 days, not the full historical lake.&lt;/p&gt;
&lt;h2&gt;Solution 3: Reflections&lt;/h2&gt;
&lt;p&gt;Reflections are pre-computed, optimized copies of data stored as Iceberg tables. When a user runs a query, Dremio&apos;s optimizer checks whether a Reflection covers the query&apos;s requirements. If it does, the engine substitutes the Reflection transparently : the user&apos;s query against the full table returns results in milliseconds because it&apos;s actually reading from a small, aggregated Reflection.&lt;/p&gt;
&lt;p&gt;There are two types:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Aggregate Reflections&lt;/strong&gt; pre-compute GROUP BY aggregations. A monthly revenue by region query that normally scans 500 GB of transaction data might run against a 2 MB aggregated Reflection instead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Raw Reflections&lt;/strong&gt; create sorted, columnar copies of tables or views. They&apos;re useful for queries that can&apos;t be satisfied by pre-aggregated data but benefit from better sort order and file layout than the original table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Autonomous Reflections&lt;/strong&gt; take this further: Dremio monitors query patterns over a 7-day rolling window and automatically creates, refreshes, and drops Reflections based on observed access patterns. You don&apos;t have to identify which queries need acceleration : the system does it.&lt;/p&gt;
&lt;p&gt;The key limitation of Reflections: they reflect data at the time of their last refresh. For near-real-time data, you need Reflections that refresh every few minutes, which adds cost. For data that&apos;s updated daily, a nightly refresh is sufficient. Match Reflection refresh frequency to your data freshness requirements.&lt;/p&gt;
&lt;h2&gt;Putting It Together: A Sub-Second BI Architecture&lt;/h2&gt;
&lt;p&gt;A practical sub-second BI architecture on Iceberg combines all three solutions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Write optimized files:&lt;/strong&gt; Target 128-512 MB file sizes, partition by filter columns, run daily compaction on streaming tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache hot data:&lt;/strong&gt; Size C3 cache for the active 30-day dataset. Monitor cache hit rates and expand as needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accelerate dashboards:&lt;/strong&gt; Create Aggregate Reflections for high-frequency dashboard queries. Enable Autonomous Reflections for the broader query workload.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Maintain metadata:&lt;/strong&gt; Run weekly manifest rewrites and snapshot expiration to keep query planning fast.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With this architecture, Dremio has demonstrated sub-second response times on Iceberg tables containing hundreds of billions of rows, for query patterns that Reflections cover. Cold queries on un-reflected data take longer, but even there, C3 and file layout optimization reduce latency from minutes to seconds.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/dremio-query-acceleration-layers.png&quot; alt=&quot;Dremio query acceleration layers for Iceberg BI&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What This Means for Your BI Stack&lt;/h2&gt;
&lt;p&gt;The biggest benefit of sub-second Iceberg queries isn&apos;t the speed itself : it&apos;s the tool compatibility. BI tools like Tableau, Power BI, Looker, and Superset expect query response times under 2 seconds for interactive use. When your data platform can&apos;t deliver that, analysts work around it by creating data extracts, local caches, and summary tables maintained separately from the canonical dataset.&lt;/p&gt;
&lt;p&gt;Those workarounds are the precursors to data swamps. When the underlying platform delivers sub-second responses, analysts stop maintaining workarounds and work directly against the governed, canonical data.&lt;/p&gt;
&lt;h2&gt;Setting SLAs for Different Query Tiers&lt;/h2&gt;
&lt;p&gt;Not all queries need sub-second response. Building a tiered SLA model aligns your investment in acceleration with actual user expectations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tier 1 : Executive dashboards and operational metrics:&lt;/strong&gt; Target sub-second response (under 1 second). Use Aggregate Reflections that refresh every 5–15 minutes. These are the queries your business depends on checking multiple times per day.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tier 2 : Analyst self-service queries:&lt;/strong&gt; Target under 10 seconds. Use Raw Reflections on commonly-queried tables and rely on C3 cache for warm queries. Analysts can tolerate a brief wait, but anything over 30 seconds breaks the investigation flow.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tier 3 : Ad-hoc and historical analysis:&lt;/strong&gt; Target under 2 minutes. No special acceleration , optimized file layout and partition pruning do the work. These queries run occasionally and analysts don&apos;t expect instant results.&lt;/p&gt;
&lt;p&gt;Document these tiers explicitly and share them with your user community. Unrealistic expectations of sub-second response for complex historical queries create dissatisfaction even when the platform is working correctly.&lt;/p&gt;
&lt;h2&gt;Freshness and the Real-Time Tradeoff&lt;/h2&gt;
&lt;p&gt;Sub-second BI and real-time data freshness are in tension. Reflections that deliver sub-second response are snapshots of data at their last refresh time. A Reflection refreshed every 15 minutes has data that&apos;s up to 15 minutes stale.&lt;/p&gt;
&lt;p&gt;For most BI use cases, 15-minute freshness is acceptable. Intraday revenue dashboards that update every 15 minutes are genuinely useful for business monitoring. Where freshness matters at a finer granularity : transaction monitoring, fraud detection, live operational dashboards , you need either very frequent Reflection refreshes or direct query paths against the most recent data files.&lt;/p&gt;
&lt;p&gt;Dremio handles this through incremental Reflection refresh: instead of recomputing the entire Reflection from scratch, it reads only the new Iceberg snapshots since the last refresh and appends the new aggregates. An incremental refresh on a table that receives hourly updates takes seconds, not minutes, making 5-minute refresh intervals practical.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and measure your query latency improvement against your current Iceberg setup.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Rise of Agentic Analytics: Shifting BI from Passive Dashboards to Goal-Directed Action</title><link>https://iceberglakehouse.com/posts/rise-of-agentic-analytics/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/rise-of-agentic-analytics/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-rise-of-agentic-analytics/).

# ...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-rise-of-agentic-analytics/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;The Rise of Agentic Analytics: Shifting BI from Passive Dashboards to Goal-Directed Action&lt;/h1&gt;
&lt;p&gt;Dashboards have a fundamental design problem: they answer the question the designer anticipated, not the question the business needs answered today. A revenue dashboard shows you revenue is down 12% this month. It doesn&apos;t tell you which product line, which region, which customer segment, which sales motion : unless someone thought to build that drill-down when they designed the dashboard six months ago.&lt;/p&gt;
&lt;p&gt;The analyst fills the gap. They open the dashboard, see the anomaly, download the CSV, write Python or SQL, iterate through hypotheses, and two hours later produce an answer. Sometimes the answer prompts another question. The cycle repeats.&lt;/p&gt;
&lt;p&gt;Agentic analytics replaces the human iteration cycle with an autonomous agent that pursues the business question until it has a defensible answer.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-vs-passive-bi.png&quot; alt=&quot;Agentic analytics vs passive BI comparison diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What Makes Analytics &amp;quot;Agentic&amp;quot;&lt;/h2&gt;
&lt;p&gt;Traditional BI is reactive and passive. A user asks a question; the system returns a fixed result. The query runs, the chart renders, and the session ends. The system holds no state between queries. It doesn&apos;t remember what was asked before or adjust its behavior based on what it learned.&lt;/p&gt;
&lt;p&gt;An agentic analytics system is goal-directed and iterative. You give it a business objective : &amp;quot;identify the cause of the 12% revenue drop this month&amp;quot; , and it runs a reasoning loop to pursue that objective. The loop looks like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Decompose the objective into a sequence of hypotheses&lt;/li&gt;
&lt;li&gt;Write SQL to test the first hypothesis&lt;/li&gt;
&lt;li&gt;Run the query and examine the results&lt;/li&gt;
&lt;li&gt;Determine whether the hypothesis is confirmed, refuted, or inconclusive&lt;/li&gt;
&lt;li&gt;Adjust the next hypothesis based on what was learned&lt;/li&gt;
&lt;li&gt;Repeat until the objective is satisfied&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The agent writes the queries, runs them, reads the results, and decides what to ask next. This is qualitatively different from a chatbot that translates one question into one query. The agent maintains context across a multi-step investigation.&lt;/p&gt;
&lt;h2&gt;The Bottleneck Traditional BI Creates&lt;/h2&gt;
&lt;p&gt;Static dashboards require analysts as intermediaries between business questions and data. That intermediation creates a throughput bottleneck.&lt;/p&gt;
&lt;p&gt;A typical analytics team with 10 analysts supports hundreds of business stakeholders. Each stakeholder generates multiple requests per week. Analysts prioritize the highest-impact requests, leaving others waiting. The average request-to-answer cycle in most organizations is 3–5 business days.&lt;/p&gt;
&lt;p&gt;During those 3–5 days, business conditions continue to change. The answer delivered on day 5 is based on data from day 1. In fast-moving markets, that lag makes the answer less useful than the delay suggests.&lt;/p&gt;
&lt;p&gt;Agentic analytics removes the analyst as the bottleneck for defined categories of analytical work. Root cause analysis, anomaly investigation, metric decomposition, and cohort comparison are all structured enough that an agent can execute them reliably. Analysts shift to defining the questions and reviewing the outputs, rather than performing the investigation manually.&lt;/p&gt;
&lt;h2&gt;What Agentic Analytics Requires From the Data Foundation&lt;/h2&gt;
&lt;p&gt;An AI agent is only as accurate as the data it works with and the context it has about that data.&lt;/p&gt;
&lt;p&gt;The failure mode is well-documented: give an LLM direct access to raw data files with generic column names like &lt;code&gt;col_23&lt;/code&gt; and &lt;code&gt;amt_usd_2024_q3&lt;/code&gt;, and it generates plausible-sounding SQL that often returns wrong answers. The model doesn&apos;t know what &lt;code&gt;amt_usd_2024_q3&lt;/code&gt; means in your business context.&lt;/p&gt;
&lt;p&gt;Three things fix this:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic context:&lt;/strong&gt; Your data catalog needs human-readable documentation at the table and column level. What does this table contain? What does this column measure? What business term does it map to? This context is what allows the agent to translate a business question into correct SQL. Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/&quot;&gt;semantic layer&lt;/a&gt; : built from virtual datasets, wikis, and labels , provides exactly this context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistent metric definitions:&lt;/strong&gt; &amp;quot;Active user&amp;quot; should mean the same thing everywhere. Define canonical metrics as virtual datasets in your catalog. The agent uses the virtual dataset, not the raw table, when answering questions about active users. Consistency eliminates the class of errors where different queries answer the &amp;quot;same&amp;quot; question with different logic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Broad data access without data movement:&lt;/strong&gt; An agentic analytics system that can only see data in one warehouse answers only the questions that warehouse can answer. Dremio&apos;s query federation connects the agent to all your data sources : operational databases, cloud warehouses, data lakes, SaaS APIs , through a unified semantic layer. The agent can join Salesforce opportunity data with Snowflake revenue data with Iceberg transaction history in a single investigation.&lt;/p&gt;
&lt;h2&gt;Goal-Directed Search vs. Single-Query Response&lt;/h2&gt;
&lt;p&gt;The architectural distinction between agentic analytics and text-to-SQL is the search loop.&lt;/p&gt;
&lt;p&gt;Text-to-SQL converts one natural language question to one SQL query. It&apos;s stateless. The quality of the answer depends entirely on whether that single query captures the full analytical intent.&lt;/p&gt;
&lt;p&gt;A goal-directed agent runs an iterative search. It generates an initial query, evaluates the result, decides whether it needs to refine the query or ask a follow-up question, and continues until the goal is satisfied or the agent determines it lacks the data to satisfy it.&lt;/p&gt;
&lt;p&gt;The search loop also enables self-correction. If a query returns an error, the agent reads the error message, diagnoses the problem (wrong column name, type mismatch, missing join condition), and retries with a corrected query. Analysts do this instinctively. Agentic systems do it systematically.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-search-loop.png&quot; alt=&quot;Agentic analytics goal-directed search loop diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Automated Workflows: Beyond Investigation&lt;/h2&gt;
&lt;p&gt;Agentic analytics extends beyond on-demand investigation. The same agent architecture that investigates anomalies can run on a schedule, monitoring KPIs autonomously and triggering investigations when metrics move outside expected ranges.&lt;/p&gt;
&lt;p&gt;An agent configured to monitor daily revenue can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detect a drop at 9 AM when the overnight batch completes&lt;/li&gt;
&lt;li&gt;Investigate root cause using the iterative query loop&lt;/li&gt;
&lt;li&gt;Identify that three specific customer accounts had processing failures&lt;/li&gt;
&lt;li&gt;Generate a summary with the relevant account IDs and estimated revenue impact&lt;/li&gt;
&lt;li&gt;Send that summary to the account management team before 10 AM standup&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No analyst needs to notice the anomaly, prioritize it, and start an investigation. The work happens automatically in the time between the data landing and the team starting their day.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/the-future-of-bi-is-agentic-how-dremio-lets-you-talk-to-your-data-wherever-it-lives/&quot;&gt;Agentic Lakehouse platform&lt;/a&gt; supports both the on-demand and scheduled variants of agentic analytics, with the semantic layer providing the consistent context the agent needs to operate reliably.&lt;/p&gt;
&lt;h2&gt;What This Means for the Analytics Team&lt;/h2&gt;
&lt;p&gt;The rise of agentic analytics doesn&apos;t eliminate the analyst role. It changes it.&lt;/p&gt;
&lt;p&gt;Manual query writing, dashboard maintenance, and stakeholder interview cycles become a smaller part of the job. System design : defining the semantic layer, configuring the agent&apos;s investigation patterns, reviewing outputs, and catching errors the agent makes , becomes a larger part.&lt;/p&gt;
&lt;p&gt;Analysts who adapt will handle 10x the analytical throughput with the same team size. The shift requires learning to evaluate agent-generated analysis rather than generating it directly, and learning to define the context that makes agent outputs trustworthy.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your first agentic analytics workflow against your existing data.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Semantic Layer as a Translation Engine: Bridging Natural Language and SQL</title><link>https://iceberglakehouse.com/posts/semantic-layer-translation-engine/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/semantic-layer-translation-engine/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-semantic-layer-translation-engin...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-semantic-layer-translation-engine/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;The Semantic Layer as a Translation Engine: Bridging Natural Language and SQL&lt;/h1&gt;
&lt;p&gt;&amp;quot;What was our revenue last quarter?&amp;quot; is a five-word question. The SQL that correctly answers it might be 40 lines long : joining three tables, applying a canonical metric definition, filtering by the right date boundaries, excluding specific transaction types, and handling currency normalization for international transactions.&lt;/p&gt;
&lt;p&gt;An AI agent bridging from that natural language question to that SQL has a translation problem. It needs to know what &amp;quot;revenue&amp;quot; means in your specific business context, what counts as &amp;quot;last quarter,&amp;quot; which tables contain the relevant data, and how those tables relate to each other.&lt;/p&gt;
&lt;p&gt;The semantic layer is what carries that business knowledge from your data team to the AI agent. Without it, the agent guesses. With it, the agent translates.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/semantic-layer-translation-engine.png&quot; alt=&quot;Semantic layer as translation engine between natural language and SQL&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Translation Stack&lt;/h2&gt;
&lt;p&gt;The path from a business question to a SQL result passes through several translation steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Intent recognition:&lt;/strong&gt; The LLM interprets the question and identifies the analytical intent (revenue by dimension, over a time period)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Entity resolution:&lt;/strong&gt; Business terms like &amp;quot;revenue,&amp;quot; &amp;quot;last quarter,&amp;quot; and &amp;quot;region&amp;quot; are mapped to specific SQL constructs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema mapping:&lt;/strong&gt; The resolved entities are mapped to actual tables and columns in the data catalog&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query generation:&lt;/strong&gt; The agent writes SQL based on the resolved schema mapping&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validation:&lt;/strong&gt; The query is validated against the schema and (optionally) executed in a test mode before full execution&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The semantic layer is the component that makes steps 2 and 3 reliable. It defines the mapping from business terms to SQL constructs, and it provides the schema documentation that the agent uses to select the right tables and columns.&lt;/p&gt;
&lt;p&gt;Without a semantic layer, the agent attempts steps 2 and 3 using only its general knowledge and the raw schema. General knowledge is unreliable for company-specific business logic. Raw schemas provide column names but not business meaning.&lt;/p&gt;
&lt;h2&gt;Virtual Datasets: Encoding Business Logic as SQL&lt;/h2&gt;
&lt;p&gt;The central artifact of the semantic layer is the virtual dataset (VDS) : a SQL view that encodes business logic as a reusable, named query.&lt;/p&gt;
&lt;p&gt;A revenue VDS might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Virtual Dataset: canonical_revenue
SELECT
  o.order_id,
  o.customer_id,
  c.region,
  c.segment,
  p.category AS product_category,
  p.subcategory AS product_subcategory,
  o.order_date,
  o.amount_usd * fx.usd_rate AS revenue_usd  -- Currency normalized
FROM orders o
  JOIN customers c ON o.customer_id = c.customer_id
  JOIN products p ON o.product_id = p.product_id
  JOIN fx_rates fx ON
    o.currency = fx.currency
    AND DATE_TRUNC(&apos;day&apos;, o.order_date) = fx.rate_date
WHERE o.status = &apos;completed&apos;  -- Exclude cancelled and refunded
  AND o.test_order = false    -- Exclude internal test transactions
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This VDS encodes four business rules: use the &lt;code&gt;completed&lt;/code&gt; status, exclude test orders, normalize currency to USD, and join the canonical customer and product dimensions. When the AI agent queries &lt;code&gt;canonical_revenue&lt;/code&gt;, it automatically applies all four rules correctly, even if it doesn&apos;t know those rules exist.&lt;/p&gt;
&lt;p&gt;The agent writes a simple query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT region, SUM(revenue_usd)
FROM canonical_revenue
WHERE order_date BETWEEN &apos;2026-01-01&apos; AND &apos;2026-03-31&apos;
GROUP BY region
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The underlying join complexity and business rule enforcement happen inside the VDS definition. The agent&apos;s query is simple and correct.&lt;/p&gt;
&lt;h2&gt;Metric Definitions in the Semantic Layer&lt;/h2&gt;
&lt;p&gt;Beyond individual columns, the semantic layer should define composite metrics as named objects.&lt;/p&gt;
&lt;p&gt;For a tool like Dremio&apos;s semantic layer, metric definitions are SQL expressions annotated with documentation:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Metric: monthly_active_users
-- Definition: Users with at least one completed order in the calendar month
SELECT
  DATE_TRUNC(&apos;month&apos;, o.order_date) AS month,
  COUNT(DISTINCT o.customer_id) AS monthly_active_users
FROM canonical_orders o
GROUP BY 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the agent needs to answer &amp;quot;how many active users did we have last month,&amp;quot; it finds the &lt;code&gt;monthly_active_users&lt;/code&gt; metric definition in the catalog, uses it as the authoritative source, and generates a query against that metric rather than reinventing the definition from scratch.&lt;/p&gt;
&lt;p&gt;Metric consistency is the most practically important benefit of semantic layer documentation. Inconsistent metric definitions : different teams calculating &amp;quot;active user&amp;quot; differently , are the most common cause of business stakeholders losing confidence in a data platform. The semantic layer resolves this by making one definition authoritative.&lt;/p&gt;
&lt;h2&gt;Wikis and Labels: Context for the Agent&lt;/h2&gt;
&lt;p&gt;Not all business context can be encoded in SQL. The semantic layer also needs natural language documentation that the agent can query directly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table wikis&lt;/strong&gt; describe the table&apos;s purpose, the business process it represents, the update frequency, the authoritative source system, and any known data quality issues:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;The &lt;code&gt;canonical_revenue&lt;/code&gt; virtual dataset represents completed transaction revenue, normalized to USD. It is updated nightly at 02:00 UTC from the order management system. Revenue is defined as the amount_usd of all orders with status=&apos;completed&apos; and test_order=false. Currency conversion uses end-of-day FX rates from the fx_rates table. This VDS does not include refunds; see &lt;code&gt;net_revenue&lt;/code&gt; for refund-adjusted figures.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Column labels&lt;/strong&gt; classify columns by type: &lt;code&gt;PII: Email&lt;/code&gt;, &lt;code&gt;Financial: Revenue&lt;/code&gt;, &lt;code&gt;Operational: Timestamp&lt;/code&gt;, &lt;code&gt;Key: Customer&lt;/code&gt;. The agent uses these labels to understand the purpose of a column without reading its values.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Relationship annotations&lt;/strong&gt; describe how datasets connect: &amp;quot;&lt;code&gt;canonical_orders&lt;/code&gt; joins to &lt;code&gt;customers&lt;/code&gt; on &lt;code&gt;customer_id&lt;/code&gt;. The relationship is many-to-one. Not all customers have orders : left join required for customer-level reporting that includes customers with zero orders.&amp;quot;&lt;/p&gt;
&lt;p&gt;When the agent is uncertain, it queries these annotations directly. Dremio&apos;s MCP server exposes them as searchable metadata, so the agent can ask &amp;quot;what does the revenue_usd column in canonical_revenue represent?&amp;quot; and get the documented answer.&lt;/p&gt;
&lt;h2&gt;Natural Language Translation Accuracy Tests&lt;/h2&gt;
&lt;p&gt;Build a test suite for your semantic layer&apos;s translation accuracy. The test suite contains reference questions with known correct SQL answers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Natural language question&lt;/th&gt;
&lt;th&gt;Expected SQL pattern&lt;/th&gt;
&lt;th&gt;Correct result range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;quot;Revenue last quarter&amp;quot;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SUM(revenue_usd)&lt;/code&gt; from &lt;code&gt;canonical_revenue&lt;/code&gt; with Q1 2026 filter&lt;/td&gt;
&lt;td&gt;$45M-$60M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;quot;Monthly active users this year&amp;quot;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT customer_id)&lt;/code&gt; monthly from &lt;code&gt;canonical_orders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;15K-25K/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;quot;Top 5 product categories by revenue&amp;quot;&lt;/td&gt;
&lt;td&gt;Revenue by &lt;code&gt;product_category&lt;/code&gt; with LIMIT 5&lt;/td&gt;
&lt;td&gt;Specific category names&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Run the agent against each test question weekly. Track accuracy over time. When accuracy drops on a specific question type, it usually indicates a documentation gap : add more context to the relevant VDS or column wiki.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/semantic-layer-accuracy-testing.png&quot; alt=&quot;Semantic layer accuracy test suite results&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Where the Semantic Layer Lives in Dremio&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s semantic layer is part of the platform&apos;s Open Catalog, not a separate product. Every virtual dataset is a SQL view that can be queried directly, shared with BI tools, and documented with wikis. The same VDS that the AI agent uses is the same one that Tableau, Power BI, or Looker queries , ensuring that dashboards and AI answers are based on identical business logic.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/&quot;&gt;semantic layer documentation&lt;/a&gt; covers how to build a three-tier (bronze/silver/gold) semantic layer in Dremio. The gold tier is the agent&apos;s primary entry point for business questions. The silver tier supports more complex multi-step analyses. The bronze tier is restricted to pipeline agents and data engineers.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and build your first semantic layer on top of your existing data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Comparing the Top 2026 Agentic Analytics Tools: ThoughtSpot, Databricks, and Tableau</title><link>https://iceberglakehouse.com/posts/top-agentic-analytics-tools-2026/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/top-agentic-analytics-tools-2026/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-top-agentic-analytics-tools-2026...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-top-agentic-analytics-tools-2026/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Comparing the Top 2026 Agentic Analytics Tools: ThoughtSpot, Databricks, and Tableau&lt;/h1&gt;
&lt;p&gt;The agentic analytics vendor landscape shifted significantly in 2025–2026. Every major BI and data platform added some form of natural language querying or AI agent capability. The terminology converged on &amp;quot;agentic analytics&amp;quot; while the architectures diverged considerably.&lt;/p&gt;
&lt;p&gt;Choosing between these platforms requires clarity on what &amp;quot;agentic&amp;quot; actually means in each product&apos;s implementation : and which implementation matches your data architecture, your team&apos;s expertise, and your specific use cases.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-analytics-platforms-comparison.png&quot; alt=&quot;Agentic analytics platforms comparison 2026&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Evaluation Criteria&lt;/h2&gt;
&lt;p&gt;Before comparing specific products, establish the criteria that matter. Agentic analytics platforms differ significantly on:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic depth:&lt;/strong&gt; How deeply does the platform understand the business meaning of data? Can it distinguish &amp;quot;revenue&amp;quot; from &amp;quot;gross revenue&amp;quot; from &amp;quot;net revenue&amp;quot; consistently? Does it respect your existing metric definitions, or does it generate its own interpretations?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Autonomy range:&lt;/strong&gt; How complex are the analytical tasks the agent can perform autonomously? Single-query translation? Multi-step investigation? Anomaly detection and root cause analysis? Proactive monitoring?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data access breadth:&lt;/strong&gt; Can the agent reach all your data sources, or only data already ingested into the vendor&apos;s platform? Does it support federated queries across cloud environments?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance integration:&lt;/strong&gt; Does the agent respect your access control policies? Does it enforce column masking for PII? Does it log its queries to your audit trail?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open standards compatibility:&lt;/strong&gt; Is the agent locked to the vendor&apos;s proprietary format, or does it work with open standards like Apache Iceberg and the Iceberg REST catalog?&lt;/p&gt;
&lt;h2&gt;ThoughtSpot&lt;/h2&gt;
&lt;p&gt;ThoughtSpot&apos;s agentic analytics story centers on Spotter, its AI agent built on top of the ThoughtSpot semantic graph. The semantic graph stores pre-defined relationships between tables, columns, and business metrics : essentially a structured representation of your business logic.&lt;/p&gt;
&lt;p&gt;Spotter uses the semantic graph as grounding for its queries. When you ask about &amp;quot;revenue by region,&amp;quot; Spotter resolves &amp;quot;revenue&amp;quot; to the canonical metric definition in the semantic graph before writing any SQL. This grounding makes Spotter relatively reliable for questions within the scope of the defined metrics.&lt;/p&gt;
&lt;p&gt;The limitation is coverage. The semantic graph must be built and maintained manually. Metrics and relationships that aren&apos;t in the graph are outside Spotter&apos;s reliable scope. For organizations with well-maintained ThoughtSpot environments, this works well. For organizations with rapidly evolving schemas or metrics that haven&apos;t been formalized, the agent&apos;s coverage gaps are a practical constraint.&lt;/p&gt;
&lt;p&gt;ThoughtSpot&apos;s data access model typically requires data to be ingested into ThoughtSpot&apos;s managed environment or connected through ThoughtSpot&apos;s supported connectors. Federated queries across arbitrary external sources are limited compared to dedicated federation platforms.&lt;/p&gt;
&lt;h2&gt;Databricks Genie&lt;/h2&gt;
&lt;p&gt;Databricks&apos; AI/BI feature set, anchored by Genie, takes a different approach. Genie operates on data stored in the Databricks Lakehouse, which uses Delta Lake (Databricks&apos; proprietary table format) or Unity Catalog-managed Iceberg tables.&lt;/p&gt;
&lt;p&gt;Genie&apos;s strength is integration with the Databricks ecosystem: it can write SQL, Python, and more complex analytical workflows within a single platform. For teams already running Databricks for ETL, ML, and analytics, Genie provides a consistent interface across all of those workloads.&lt;/p&gt;
&lt;p&gt;The semantic grounding in Genie relies on Unity Catalog&apos;s metadata and any additional context provided through Genie spaces : documented spaces where administrators define what data is available and provide natural language descriptions. The quality of Genie&apos;s responses is highly sensitive to how well those spaces are documented.&lt;/p&gt;
&lt;p&gt;Data access is strongest within the Databricks environment. Federated queries to external sources work through Unity Catalog&apos;s external connection feature, but the federation depth is more limited than dedicated federation platforms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Evaluation question to ask:&lt;/strong&gt; What percentage of your data already lives in Databricks or Delta Lake? If it&apos;s over 80%, Genie is a reasonable choice. If you have significant data outside the Databricks ecosystem, the federation gap matters.&lt;/p&gt;
&lt;h2&gt;Tableau Pulse and Tableau AI&lt;/h2&gt;
&lt;p&gt;Tableau&apos;s agentic analytics offering evolved through 2024–2026 into Tableau Pulse and embedded Salesforce AI features. Pulse provides automated metric monitoring with natural language summaries : it&apos;s closer to automated reporting than autonomous investigation.&lt;/p&gt;
&lt;p&gt;Tableau&apos;s AI features are strongest in the visualization and narrative generation layer. The agent produces charts, summarizes trends, and suggests related metrics to explore. It&apos;s less capable at multi-step analytical investigation compared to ThoughtSpot Spotter or Databricks Genie.&lt;/p&gt;
&lt;p&gt;Tableau connects to a wide variety of data sources, but its query translation is engine-specific : it generates queries optimized for the connected data source rather than routing through a single SQL interface. For multi-source queries, you need to pre-join data before it reaches Tableau.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best fit:&lt;/strong&gt; Teams that want AI-enhanced dashboards and metric monitoring rather than autonomous multi-step investigation.&lt;/p&gt;
&lt;h2&gt;Where Dremio Fits&lt;/h2&gt;
&lt;p&gt;Dremio occupies a different position in this landscape. Rather than layering AI on top of a BI tool or a data platform that owns your data, Dremio&apos;s AI agent sits on top of a federated query engine that reaches your data wherever it lives.&lt;/p&gt;
&lt;p&gt;The architecture difference matters for two reasons:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data access breadth:&lt;/strong&gt; Dremio&apos;s built-in AI agent can query across Iceberg tables, PostgreSQL databases, Snowflake, MongoDB, S3, and dozens of other sources through a single SQL interface. Other tools&apos; agents are limited to data within or closely connected to their own ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic layer ownership:&lt;/strong&gt; Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/&quot;&gt;semantic layer&lt;/a&gt; : virtual datasets, wikis, labels , lives in the catalog, not in the AI product. When you switch models or agents, the semantic context stays in Dremio and applies to any new agent you connect. No other tool&apos;s semantic configuration is portable in the same way.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open standards:&lt;/strong&gt; Dremio&apos;s MCP server allows external AI clients (Claude Desktop, ChatGPT, custom Python agents) to connect to Dremio&apos;s environment and use the same semantic context and governance model. You&apos;re not locked into Dremio&apos;s specific agent implementation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-semantic-depth-comparison.png&quot; alt=&quot;Agentic analytics semantic depth comparison across platforms&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Evaluation Questions for Your Organization&lt;/h2&gt;
&lt;p&gt;Use these questions to match your requirements to the right platform:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Where does most of your data live? In one platform, or distributed across multiple systems?&lt;/li&gt;
&lt;li&gt;How well-defined are your canonical metrics? Do you have a semantic layer already?&lt;/li&gt;
&lt;li&gt;What analytical tasks do you need the agent to perform autonomously : single queries, multi-step investigations, or proactive monitoring?&lt;/li&gt;
&lt;li&gt;Do your agents need to respect existing access control policies (row-level security, column masking)?&lt;/li&gt;
&lt;li&gt;How important is it to connect external AI tools (ChatGPT, custom agents) to the same data and context?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your answers point to distributed data, incomplete semantic definitions, and a need for governance integration, an open platform like Dremio that federates across sources and provides an open semantic layer is more flexible than a tool that requires your data to be centralized in its ecosystem.&lt;/p&gt;
&lt;h2&gt;Getting the Most from Any Agentic Analytics Platform&lt;/h2&gt;
&lt;p&gt;Regardless of which platform you choose, agentic analytics works better when you invest in the underlying data quality and documentation. Here&apos;s what matters:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complete the semantic layer before deploying the agent.&lt;/strong&gt; Every platform performs better when its semantic foundation is solid. Define your canonical metrics before asking the agent to generate them. Document your join paths before expecting correct multi-table queries. An agent deployed on undocumented data produces unreliable results, regardless of the underlying model quality.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Set clear scope boundaries.&lt;/strong&gt; Agents that can access everything tend to go off-script. For production deployments, limit the agent&apos;s table access to the datasets relevant to its use case. A financial reporting agent doesn&apos;t need access to HR data. Scope limitation improves accuracy and reduces governance risk.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Build a testing harness.&lt;/strong&gt; For any agentic analytics deployment, maintain a set of test questions with known correct answers. Run the agent against these questions weekly. Track accuracy over time. When a platform update changes the underlying model or schema changes affect the semantic layer, your test harness will catch regressions before users do.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Plan for misuse.&lt;/strong&gt; Users will eventually try to ask the agent questions it can&apos;t answer reliably : questions outside the defined data scope, questions requiring business context that isn&apos;t in the catalog, or questions where the data simply doesn&apos;t exist. Design the agent&apos;s failure response to be useful: &amp;quot;I don&apos;t have visibility into Q3 2022 data because it&apos;s outside the retention window&amp;quot; is more useful than an incorrect answer that looks plausible.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and compare the agent&apos;s output quality against your current tool on the same analytical questions.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Trustworthy AI in the Agentic Lakehouse: Reconciling Concurrency and Isolation Contracts</title><link>https://iceberglakehouse.com/posts/trustworthy-ai-concurrency-isolation/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/trustworthy-ai-concurrency-isolation/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-trustworthy-ai-concurrency-isola...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-trustworthy-ai-concurrency-isolation/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Trustworthy AI in the Agentic Lakehouse: Reconciling Concurrency and Isolation Contracts&lt;/h1&gt;
&lt;p&gt;A single AI agent querying your lakehouse is manageable. A hundred AI agents : running automated monitoring, answering stakeholder questions, generating reports, and powering agentic workflows , create concurrency and isolation problems that traditional data architectures weren&apos;t designed for.&lt;/p&gt;
&lt;p&gt;Human analysts are slow. They ask questions sequentially, pause to think, and rarely trigger more than a handful of concurrent queries against the same table. AI agents are fast and relentless. They can run dozens of queries per minute, issue transactions that interleave with other agents&apos; writes, and hit edge cases in concurrency control that human query patterns never surface.&lt;/p&gt;
&lt;p&gt;This post covers how Iceberg&apos;s optimistic concurrency control handles multi-agent write conflicts, how fine-grained access control prevents agents from accessing data outside their authorization scope, and what guardrail policies need to look like when autonomous systems have SQL access to production data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/agentic-lakehouse-concurrency.png&quot; alt=&quot;Agentic lakehouse concurrency and isolation architecture&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Iceberg&apos;s Optimistic Concurrency Control for Multi-Agent Writes&lt;/h2&gt;
&lt;p&gt;Apache Iceberg uses optimistic concurrency control (OCC) rather than pessimistic locking. In a pessimistic model, a writer acquires an exclusive lock before starting a write. In an optimistic model, the writer proceeds without a lock and validates at commit time that no conflicting write has occurred since it started.&lt;/p&gt;
&lt;p&gt;The mechanics: every Iceberg table has a current snapshot ID. When an agent begins a write operation, it reads the current snapshot ID. When it&apos;s ready to commit, it attempts to update the table metadata to point to a new snapshot : but only if the current snapshot ID still matches what it read at the start. If another agent committed a conflicting change in the interim, the commit fails with a conflict error.&lt;/p&gt;
&lt;p&gt;The key insight: not all concurrent writes conflict. Iceberg&apos;s conflict detection is partition-aware. Two agents writing to different partitions of the same table can both succeed without conflict. Two agents writing to the same partition conflict if their changes can&apos;t be merged safely.&lt;/p&gt;
&lt;p&gt;For agentic analytics workloads where agents primarily read and occasionally write derived results (not raw source data), OCC provides practical concurrency with minimal blocking. The conflict rate is low when agents write to well-partitioned tables.&lt;/p&gt;
&lt;p&gt;For agentic data pipeline scenarios where multiple agents might update the same partition simultaneously, the conflict rate increases. Design partitioning to minimize overlap between agents&apos; write scopes.&lt;/p&gt;
&lt;h2&gt;Conflict Resolution Strategies&lt;/h2&gt;
&lt;p&gt;When OCC conflicts occur in a multi-agent system, the retry behavior matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Exponential backoff with jitter:&lt;/strong&gt; The failed agent waits a random interval before retrying, with the interval growing on each retry. Jitter prevents synchronized retry storms where all conflicting agents retry at exactly the same time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Idempotent writes:&lt;/strong&gt; Design agent write operations so that re-running them after a conflict produces the same result. If an agent computes a metric and writes it to a results table, a retry after a conflict should produce the same metric value, not a duplicate row.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Write-through coordination:&lt;/strong&gt; For scenarios with high write conflict rates, use a coordinator pattern: all agents that need to write to the same table submit writes to a coordinator agent, which serializes them and commits in order. This reduces concurrency but eliminates conflicts.&lt;/p&gt;
&lt;p&gt;The choice depends on your conflict rate. Measure actual conflict rates in production. If conflicts are rare (under 1%), simple exponential backoff is sufficient. If conflicts are frequent, the write pattern needs redesign.&lt;/p&gt;
&lt;h2&gt;Fine-Grained Access Control for AI Agents&lt;/h2&gt;
&lt;p&gt;AI agents should have the minimum access necessary to perform their function. An agent that generates revenue reports doesn&apos;t need access to employee compensation tables. An agent monitoring operational pipelines doesn&apos;t need to read customer PII.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s fine-grained access control (FGAC) enforces these restrictions consistently across all agent connections. An agent&apos;s service principal is bound to a role, and that role defines exactly which tables the agent can read, which columns are visible, and which rows fall within its authorized scope.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column masking for AI agents:&lt;/strong&gt; When an agent with an analyst role queries a table containing SSNs, the SSN column returns masked values (&lt;code&gt;****-**-1234&lt;/code&gt;). The agent can&apos;t request unmasked data even if it generates SQL that directly references the SSN column : the masking is enforced by the query engine before results are returned.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; An agent responsible for North America operations sees only North American rows, even if it generates a query without a regional filter. The filter is applied automatically based on the agent&apos;s role context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time-limited access tokens:&lt;/strong&gt; Agent credentials should expire frequently : hourly or more often for high-privilege agents. Long-lived tokens that are compromised or leaked provide extended unauthorized access. Dremio&apos;s token-based authentication supports short-lived credentials appropriate for agent workloads.&lt;/p&gt;
&lt;h2&gt;Guardrail Policies for Autonomous SQL Access&lt;/h2&gt;
&lt;p&gt;Governance policies for AI agents need to anticipate behaviors that don&apos;t occur with human analysts:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query cost limits:&lt;/strong&gt; AI agents can generate unexpectedly expensive queries : full table scans without filters, recursive CTEs with large intermediate result sets, or aggregations across billions of rows that a human analyst wouldn&apos;t attempt interactively. Implement query cost estimation limits that reject or queue queries above a compute budget threshold.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Write operation restrictions:&lt;/strong&gt; Most AI agents should be read-only. An agent that can write to production tables can modify data, drop snapshots, or corrupt table state if it generates incorrect write SQL. Use read-only service principals for analytics agents. Restrict write access to pipeline agents with tightly defined write patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rate limiting per agent:&lt;/strong&gt; An AI agent responding to a wave of stakeholder questions might generate hundreds of queries in a short period. Rate limiting prevents a single agent workload from consuming all available query capacity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit log requirements:&lt;/strong&gt; Every query an AI agent executes should be logged with the agent&apos;s identity, the query text, the tables accessed, and the execution result. This is the foundation for compliance audits and for diagnosing incorrect agent behavior after the fact.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scope constraints in the system prompt:&lt;/strong&gt; For agent implementations that use LLM-based reasoning, the system prompt should explicitly tell the agent which tables it&apos;s authorized to use. Even if access control prevents unauthorized access at the engine level, a clear scope in the prompt prevents the agent from generating queries that will fail, wasting compute and slowing response time.&lt;/p&gt;
&lt;h2&gt;Isolation Contracts: What Each Agent Can See&lt;/h2&gt;
&lt;p&gt;When hundreds of agents run concurrently, isolation means each agent sees a consistent view of the data, unaffected by other agents&apos; concurrent writes.&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s snapshot-based reads provide this automatically. When an agent starts a read, it reads from the current snapshot. Concurrent writes by other agents create new snapshots. The reading agent continues reading from its starting snapshot until its query completes : it never sees partial data from an in-progress write.&lt;/p&gt;
&lt;p&gt;This is the same isolation guarantee that makes Iceberg safe for concurrent human users. It extends naturally to AI agents, regardless of concurrency level.&lt;/p&gt;
&lt;p&gt;The edge case: an agent that starts a long-running investigation (10 minutes) and then writes a derived result to a results table may be writing based on data that&apos;s 10 minutes old. Other agents that have written in the interim have created newer snapshots. The result the long-running agent writes is consistent with its starting state but may not reflect the latest committed data.&lt;/p&gt;
&lt;p&gt;For most analytical workloads, this is acceptable : a 10-minute-old result is still a valid analytical result. For operational dashboards requiring the latest data, keep investigation tasks short enough that the staleness is within tolerance.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/multi-agent-isolation-snapshots.png&quot; alt=&quot;Multi-agent isolation snapshot model diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Testing Concurrency at Scale&lt;/h2&gt;
&lt;p&gt;Before deploying a multi-agent agentic system to production, load test at the expected concurrency level.&lt;/p&gt;
&lt;p&gt;Generate synthetic agent workloads: N agents, each running a realistic mix of investigation queries, metadata lookups, and occasional writes. Run for 30 minutes and measure:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OCC conflict rate (should be under 5% for well-designed write patterns)&lt;/li&gt;
&lt;li&gt;Query latency percentiles under concurrent load (P50, P95, P99)&lt;/li&gt;
&lt;li&gt;Cache hit rates (Dremio&apos;s C3 cache should show high hit rates for repeated agent query patterns)&lt;/li&gt;
&lt;li&gt;Error rates for rate-limit violations and access control rejections&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Any failure modes discovered during load testing are better discovered in staging than in production. Tune rate limits, retry policies, and write patterns based on the results.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your multi-agent concurrency tests against a production-grade agentic lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The 2026 Unified Data Architecture: Reconciling Multi-Cloud Data Lakehouses</title><link>https://iceberglakehouse.com/posts/unified-data-architecture-2026/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/unified-data-architecture-2026/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-unified-data-architecture-2026/)...</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-unified-data-architecture-2026/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;The 2026 Unified Data Architecture: Reconciling Multi-Cloud Data Lakehouses&lt;/h1&gt;
&lt;p&gt;Three years ago, &amp;quot;multi-cloud strategy&amp;quot; for data meant maintaining separate warehouses on AWS, Azure, and GCP, then running ETL to sync them. Teams spent more time on pipeline maintenance than on actual analysis. That approach is giving way to something simpler: a shared table format, a unified catalog, and a query engine that reaches across cloud boundaries without moving data.&lt;/p&gt;
&lt;p&gt;The 2026 unified data architecture isn&apos;t a single product. It&apos;s a stack of composable layers held together by open standards. This post explains what those layers are, how they fit together, and where the real complexity lives.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/unified-data-architecture-2026.png&quot; alt=&quot;Multi-cloud unified data architecture 2026 diagram&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Multi-Cloud Data Is Still Hard&lt;/h2&gt;
&lt;p&gt;Multi-cloud data deployments exist for several reasons: different cloud regions for data residency compliance, different providers for different workloads (AWS for ML, Azure for enterprise apps), acquisitions that bring in existing infrastructure, or deliberate vendor diversification.&lt;/p&gt;
&lt;p&gt;The problem is that each cloud&apos;s native analytical services are optimized for data stored in that cloud&apos;s own object storage, in that cloud&apos;s preferred format. BigQuery is fast on BigQuery storage. Redshift is fast on Redshift clusters. Neither was designed to read data stored in the other&apos;s format, and both charge egress fees when data moves between clouds.&lt;/p&gt;
&lt;p&gt;The result: organizations running on multiple clouds end up with data silos tied to each cloud, even if the data is conceptually the same dataset.&lt;/p&gt;
&lt;p&gt;Apache Iceberg changes the equation because it stores data in open Parquet files on any S3-compatible object storage. Multiple engines can read the same Iceberg table without copying it. The catalog tracks where the files are. The compute engine comes to the data.&lt;/p&gt;
&lt;h2&gt;The Four Layers of the 2026 Composable Architecture&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Layer 1: Open Storage&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Data lives in object storage : S3, Azure Data Lake Storage, GCS , in Parquet files organized by the Iceberg spec. The storage tier is the cheapest and the most portable. Moving from one object store to another is an infrastructure-level operation, not an application-level one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 2: Open Table Format&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Apache Iceberg provides the metadata layer: manifest files, snapshot history, partition structure, schema evolution. Any engine that reads Iceberg can read these tables. The format is standardized, open-source, and governed by the Apache Software Foundation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 3: Open Catalog&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The catalog tracks which Iceberg tables exist, where their metadata files live, and what access control policies apply. Apache Polaris is the open-source reference implementation of the Iceberg REST catalog spec. Dremio&apos;s Open Catalog extends Polaris with federated source connections, bringing non-Iceberg data sources (databases, cloud warehouses, object storage) into the same governed namespace.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 4: Query Federation&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The query engine reads from the catalog, finds file locations, reads Parquet data from object storage, and returns results. Dremio&apos;s query federation connects to multiple catalogs and data sources : from different cloud environments , and executes cross-source SQL queries without moving data. Predicate pushdown reduces how much data crosses the network.&lt;/p&gt;
&lt;h2&gt;Zero-ETL Federation: What It Means in Practice&lt;/h2&gt;
&lt;p&gt;Zero-ETL federation means querying data where it lives instead of ingesting it first. Dremio connects directly to a Snowflake schema, an operational PostgreSQL database, an AWS Glue catalog, and an on-premises Iceberg table, then executes a single SQL query that joins data from all four sources.&lt;/p&gt;
&lt;p&gt;The analyst writes one query. Dremio breaks it into source-specific subqueries, pushes predicates to each source, retrieves only the filtered results, and assembles the final answer. No ETL pipeline. No data duplication. No stale data from a batch sync that ran last night.&lt;/p&gt;
&lt;p&gt;The tradeoff: federated queries that span many sources depend on network latency between the query engine and each source. For high-frequency dashboards, Dremio&apos;s Reflections feature creates pre-computed materializations of frequently-queried federated data. The materialization runs at the scheduled refresh interval; the dashboard query reads from the local materialized copy.&lt;/p&gt;
&lt;h2&gt;Multi-Cloud Routing for Cost Efficiency&lt;/h2&gt;
&lt;p&gt;When you separate compute from storage using Iceberg, you can route different workloads to different compute environments without migrating the data.&lt;/p&gt;
&lt;p&gt;Batch ELT runs on spot Spark clusters in the cloud region where the data is stored : no egress. Interactive queries for a business unit in Europe run through a Dremio cluster in an EU region, accessing an EU Iceberg table directly. Global reporting that needs to join EU and US datasets runs through a central Dremio instance with federated connections to both regions.&lt;/p&gt;
&lt;p&gt;Each layer is independently scalable. You don&apos;t pay for idle compute in regions where no active queries are running. You don&apos;t pay for data movement when you can push the query to the data instead.&lt;/p&gt;
&lt;h2&gt;Governance Across Clouds&lt;/h2&gt;
&lt;p&gt;The governance challenge in multi-cloud lakehouses is maintaining consistent access policies across environments that each have their own IAM systems.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s Open Catalog acts as a unified governance layer on top of those environment-specific IAM policies. It enforces role-based access control, fine-grained row and column policies, and audit logging through a single interface, regardless of whether the underlying data lives in AWS, Azure, or on-premises.&lt;/p&gt;
&lt;p&gt;When a compliance team needs to demonstrate that European customer data was never accessed by US-based compute, the Dremio audit log captures every query with the user identity, source table, and execution environment. That&apos;s the audit trail that regulated organizations need.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/&quot;&gt;AI semantic layer&lt;/a&gt; extends this governance model to AI agents: virtual datasets with access control ensure that AI queries against sensitive data go through the same masking and filtering rules as human queries. An AI agent can&apos;t access more data than a human analyst with the same role.&lt;/p&gt;
&lt;h2&gt;AI Agents in the Unified Architecture&lt;/h2&gt;
&lt;p&gt;The unified data architecture is also the prerequisite for reliable agentic analytics. An AI agent that needs to answer business questions across multi-cloud data sources needs three things: a single SQL interface that reaches all sources, a semantic layer that documents what the data means, and access control that limits the agent&apos;s scope to its authorized data.&lt;/p&gt;
&lt;p&gt;All three are properties of the unified architecture described above. The federated query engine provides the single SQL interface. The Open Catalog&apos;s wiki and virtual dataset documentation provides the semantic layer. The RBAC policies enforced by the catalog define the agent&apos;s access scope.&lt;/p&gt;
&lt;p&gt;Without this architecture, AI agents working on multi-cloud data face the same fragmentation problem that human analysts do : needing to switch contexts between tools, manually joining data from separate sources, or relying on ETL pipelines that are always slightly out of date.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;MCP server&lt;/a&gt; exposes the unified catalog to external AI clients, including Claude, ChatGPT, and custom LangChain agents. The agent connects once, sees all authorized sources through a single namespace, and queries across cloud boundaries using standard SQL. The architecture that makes multi-cloud data manageable for human analysts makes it reliable for AI agents.&lt;/p&gt;
&lt;h2&gt;Common Failure Modes in Multi-Cloud Lakehouse Projects&lt;/h2&gt;
&lt;p&gt;Multi-cloud lakehouse projects fail in predictable ways. Understanding them upfront helps you avoid the common traps:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-federating without Reflections:&lt;/strong&gt; Federation handles flexible, one-off queries well. For high-frequency dashboard queries that join several large tables, pure federation is too slow. Teams that don&apos;t plan for materialization end up with dashboards that time out or force a return to ETL. Build your Reflections strategy before you roll out dashboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ignoring egress costs during design:&lt;/strong&gt; Federated queries avoid data movement in normal operation, but some patterns still trigger egress : cross-region joins where both datasets are large, queries that can&apos;t push predicates effectively, or Reflections that refresh across cloud boundaries. Map your query patterns to egress scenarios before choosing your compute placement strategy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skipping the semantic layer:&lt;/strong&gt; The unified architecture solves the data access problem but doesn&apos;t solve the data understanding problem. If your catalog is not documented : no wikis, no column labels, no canonical virtual datasets , your users (human and AI) will query the wrong tables, use inconsistent metric definitions, and lose confidence in results. Build the semantic layer in parallel with the connectivity layer, not after.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Treating the catalog as optional:&lt;/strong&gt; Teams that deploy Dremio for query federation but skip the Open Catalog setup lose the governance and semantic benefits. The catalog is what converts a query engine into a governed data platform. It&apos;s not optional infrastructure for production deployments.&lt;/p&gt;
&lt;h2&gt;Start with One Region, Expand Deliberately&lt;/h2&gt;
&lt;p&gt;The unified multi-cloud architecture works best when you build it incrementally. Start by centralizing access to your existing data through Dremio without moving anything. Connect your data sources as federated connections. Identify the high-value queries that currently require ETL, and test whether Reflections-based materialization gives you acceptable performance.&lt;/p&gt;
&lt;p&gt;Once the pattern holds for one region and a handful of sources, extend it to additional regions and additional sources. The open catalog standard means new sources connect through the same interface regardless of where they live.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start federating your multi-cloud data sources from day one.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Why Traditional Lakehouses Fail AI Agents: The Mathematical Case for the Agentic Lakehouse</title><link>https://iceberglakehouse.com/posts/why-lakehouses-fail-ai-agents/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/why-lakehouses-fail-ai-agents/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-why-lakehouses-fail-ai-agents/)....</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-why-lakehouses-fail-ai-agents/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;Why Traditional Lakehouses Fail AI Agents: The Mathematical Case for the Agentic Lakehouse&lt;/h1&gt;
&lt;p&gt;When organizations first try connecting an LLM to their data lakehouse, the experience follows a predictable pattern: early demos work surprisingly well, production queries fail in embarrassing ways, and teams spend months debugging why the AI produces confident, plausible, wrong answers.&lt;/p&gt;
&lt;p&gt;The failure isn&apos;t the LLM. The failure is the data architecture.&lt;/p&gt;
&lt;p&gt;A traditional lakehouse was designed for human analysts who bring implicit knowledge to every query session. Column named &lt;code&gt;amt_q3_fy24_usd&lt;/code&gt;? A human analyst asks a colleague what that means. An AI agent does not : it guesses, and its guess is based on statistical patterns from its training data, not the actual business definition at your company.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/traditional-vs-agentic-lakehouse.png&quot; alt=&quot;Traditional lakehouse vs agentic lakehouse for AI agents&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Common-Sense Context Gap&lt;/h2&gt;
&lt;p&gt;Humans navigate ambiguous schemas through implicit knowledge and social querying. &amp;quot;What does &lt;code&gt;cust_flg&lt;/code&gt; mean?&amp;quot; gets answered in 30 seconds by asking the data owner. AI agents can&apos;t ask questions unless they have a tool specifically designed for that purpose, and most traditional lakehouse setups don&apos;t provide one.&lt;/p&gt;
&lt;p&gt;The LLM fills the gap with statistical inference. It&apos;s seen thousands of database schemas in its training data. &lt;code&gt;cust_flg&lt;/code&gt; probably means customer flag. Which flag? The model&apos;s best guess based on other columns in the context. That guess is right sometimes and wrong in ways that are impossible to detect without knowing the ground truth.&lt;/p&gt;
&lt;p&gt;The problem compounds with scale. A traditional lakehouse with 500 tables has hundreds of ambiguous column names, inconsistent naming conventions across teams, multiple definitions of the same business concept in different tables, and no machine-readable documentation of which tables are authoritative vs. deprecated.&lt;/p&gt;
&lt;p&gt;When an AI agent explores this environment, its probability of generating correct SQL decreases with each additional ambiguity it encounters. A query that requires joining three tables, each with naming inconsistencies and undocumented column semantics, has a much lower probability of being correct than a query against a single, well-documented table.&lt;/p&gt;
&lt;h2&gt;The Semantic Barrier: A Probabilistic View&lt;/h2&gt;
&lt;p&gt;Let&apos;s formalize this with a simple model. Assume each column in your schema has a probability P(correct) of being interpreted correctly by an AI agent without additional context. For a well-named, self-explanatory column like &lt;code&gt;order_date&lt;/code&gt;, P(correct) might be 0.98. For a poorly named column like &lt;code&gt;flag_3&lt;/code&gt;, P(correct) might be 0.40.&lt;/p&gt;
&lt;p&gt;A SQL query that involves N column references has a combined correctness probability of approximately:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;P(query correct) ≈ P(col_1) × P(col_2) × ... × P(col_N)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a 10-column query with an average per-column probability of 0.80:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;P(query correct) ≈ 0.80^10 ≈ 0.107
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That&apos;s roughly a 10% chance the full query is interpreted correctly. For complex analytical queries with 20+ column references across multiple tables, the probability approaches zero.&lt;/p&gt;
&lt;p&gt;This is the mathematical case for why putting an AI agent directly on top of a traditional lakehouse produces unreliable results. The agent isn&apos;t making single mistakes : it&apos;s making compounding probabilistic errors across every ambiguous element it encounters.&lt;/p&gt;
&lt;h2&gt;The Semantic Barrier in Practice&lt;/h2&gt;
&lt;p&gt;Three specific failure modes appear repeatedly when AI agents encounter traditional lakehouses:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metric inconsistency:&lt;/strong&gt; Your marketing team defined &amp;quot;active user&amp;quot; as anyone who logged in this month. Your analytics team defined it as anyone who placed an order this month. The lakehouse has tables from both teams. The agent picks one definition arbitrarily and uses it consistently : consistently wrong for half of your stakeholders.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stale table selection:&lt;/strong&gt; The lakehouse has both &lt;code&gt;orders_v1&lt;/code&gt; and &lt;code&gt;orders_v2&lt;/code&gt; tables. v2 replaced v1 eight months ago. Without documentation indicating which is current, the agent sometimes uses v1 (which still has more historical data), producing counts that include duplicate records from the migration period.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type coercion errors:&lt;/strong&gt; A column defined as VARCHAR actually contains dates in &lt;code&gt;YYYY-MM-DD&lt;/code&gt; format. The agent generates a date filter without explicit CAST, which works in some SQL dialects and silently returns all rows in others.&lt;/p&gt;
&lt;p&gt;Each of these produces a confident, wrong answer. The agent doesn&apos;t know it&apos;s wrong : it didn&apos;t hallucinate from nowhere, it made a reasonable inference from the context available. The context was insufficient.&lt;/p&gt;
&lt;h2&gt;What an Agentic Lakehouse Provides&lt;/h2&gt;
&lt;p&gt;The agentic lakehouse solves the common-sense context gap by building business context into the data layer, not into the agent&apos;s prompt.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Virtual datasets as canonical definitions:&lt;/strong&gt; Instead of exposing raw tables to the agent, an agentic lakehouse exposes SQL-defined views that encode business logic. &amp;quot;Active users&amp;quot; is a virtual dataset with the correct, agreed-upon definition. The agent queries the virtual dataset, and the underlying logic is correct by construction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wikis and labels as machine-readable documentation:&lt;/strong&gt; Each table and column has structured documentation: what it contains, what business concept it maps to, whether it&apos;s the authoritative source or a derivative, what PII classification it carries. The agent queries this documentation when it needs context, rather than inferring from column names.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Medallion architecture for trust levels:&lt;/strong&gt; Bronze tables are raw and untrusted. Silver tables are cleaned, typed, and business-ready. Gold tables are purpose-built aggregations. The agent knows which tier to use for which type of question. It doesn&apos;t have to guess which &lt;code&gt;orders&lt;/code&gt; table is the authoritative one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fine-grained access control:&lt;/strong&gt; The agent can only access what its role permits. PII columns are masked. Rows outside the agent&apos;s authorized scope are filtered. The same governance model that applies to human analysts applies to AI agents automatically.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/the-ai-foundation-of-the-agentic-lakehouse/&quot;&gt;Agentic Lakehouse&lt;/a&gt; implements all four components. The semantic layer provides virtual datasets and documented metadata. The AI agent consults the semantic layer before generating queries. The query federation engine connects to all data sources through the same governed interface.&lt;/p&gt;
&lt;h2&gt;The Mathematical Improvement&lt;/h2&gt;
&lt;p&gt;Returning to the probabilistic model: a semantic layer with rich, accurate documentation changes the per-column correctness probabilities dramatically.&lt;/p&gt;
&lt;p&gt;For columns exposed through well-documented virtual datasets, P(correct) approaches 0.99 : because the column name corresponds to a canonical business definition the agent can look up.&lt;/p&gt;
&lt;p&gt;For a 10-column query where every column comes from documented virtual datasets:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;P(query correct) ≈ 0.99^10 ≈ 0.904
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That&apos;s a move from ~10% to ~90% reliability for complex queries : achieved not by improving the LLM, but by improving the data layer it operates on.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/semantic-layer-accuracy-improvement.png&quot; alt=&quot;Semantic layer probability improvement for AI agent accuracy&quot;&gt;&lt;/p&gt;
&lt;p&gt;The model is simplified : real correctness probabilities depend on many factors , but the directional effect is real. The investment in semantic layer documentation pays directly in AI agent accuracy.&lt;/p&gt;
&lt;h2&gt;Starting the Transition&lt;/h2&gt;
&lt;p&gt;The transition from a traditional lakehouse to an agentic lakehouse is incremental. You don&apos;t need to document every table before you start.&lt;/p&gt;
&lt;p&gt;Begin with the 10–20 most frequently queried datasets. Create virtual datasets with canonical business logic. Write wiki documentation for every column in those datasets. Classify PII columns. Test the agent against those datasets and measure whether its accuracy improves.&lt;/p&gt;
&lt;p&gt;Then expand to the next tier of datasets. Each documented dataset extends the reliable scope of the agent.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start building the semantic layer that makes your AI agents reliable.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Era of Zero-ETL Federation: Fueling AI Agents with Real-Time Cross-Enterprise Data</title><link>https://iceberglakehouse.com/posts/zero-etl-federation-ai-agents/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/zero-etl-federation-ai-agents/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-zero-etl-federation-ai-agents/)....</description><pubDate>Thu, 28 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-zero-etl-federation-ai-agents/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;The Era of Zero-ETL Federation: Fueling AI Agents with Real-Time Cross-Enterprise Data&lt;/h1&gt;
&lt;p&gt;ETL pipelines were the right answer in 2010. You pulled data from operational systems nightly, transformed it, and loaded it into the warehouse. Analysts got yesterday&apos;s data by 8 AM. The tradeoff was acceptable.&lt;/p&gt;
&lt;p&gt;The tradeoff is no longer acceptable for agentic analytics. An AI agent investigating a revenue anomaly needs today&apos;s Salesforce opportunity data, not last night&apos;s batch. An agent monitoring supply chain risk needs the current inventory system state, not a snapshot from six hours ago. Batch ETL introduces a latency floor that makes certain classes of analytical questions unanswerable in time to act on them.&lt;/p&gt;
&lt;p&gt;Zero-ETL federation is the answer: the query engine reaches the data where it lives, when you need it, without a separate pipeline copying it first.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/zero-etl-federation-architecture.png&quot; alt=&quot;Zero-ETL federation architecture for AI agents&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Batch ETL Limits Agentic Analytics&lt;/h2&gt;
&lt;p&gt;Traditional ETL follows a fixed cadence. Data lands in the analytical system hours or days after it was created in the operational system. That gap is the latency floor : the minimum time between an event happening and an agent being able to analyze it.&lt;/p&gt;
&lt;p&gt;For retrospective analysis (what happened last quarter), batch ETL is fine. The latency floor doesn&apos;t matter when the questions are about history.&lt;/p&gt;
&lt;p&gt;For agentic analytics focused on current business conditions, batch ETL creates real problems:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stale anomaly detection:&lt;/strong&gt; An AI agent monitoring revenue might detect an anomaly in yesterday&apos;s data that actually resolved itself six hours ago. The agent&apos;s investigation and alert are based on a state that no longer exists.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Incomplete cross-system joins:&lt;/strong&gt; When a customer churn prediction agent needs to join support ticket data (from a real-time source) with purchase history (in the lakehouse), a batch ETL approach requires either waiting for the next batch or running two separate analyses that can&apos;t be easily joined.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Decision lag:&lt;/strong&gt; Agents supporting operational decisions : sales prioritization, inventory allocation, customer routing , need current data to produce actionable recommendations. A 12-hour lag in the underlying data produces recommendations that are 12 hours behind reality.&lt;/p&gt;
&lt;h2&gt;The Federation Architecture&lt;/h2&gt;
&lt;p&gt;Zero-ETL federation routes queries directly to source systems at query time. The query engine acts as the router : it receives a SQL query from an AI agent, identifies which tables come from which sources, sends source-specific subqueries to each source, and assembles the results.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s federation architecture connects to sources through source-specific connectors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Relational databases:&lt;/strong&gt; PostgreSQL, MySQL, Oracle, SQL Server : queried via JDBC with predicate pushdown&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud warehouses:&lt;/strong&gt; Snowflake, BigQuery, Redshift : queried through their native query APIs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Object storage:&lt;/strong&gt; S3, GCS, ADLS : Iceberg or raw Parquet, queried directly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SaaS systems:&lt;/strong&gt; Salesforce, Zendesk, HubSpot : queried through API-based connectors&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming systems:&lt;/strong&gt; Kafka, Kinesis : queried through streaming-to-SQL adapters&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each source appears in Dremio&apos;s unified namespace. An agent queries the namespace without needing to know which underlying system holds which data.&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown: The Key to Federation Performance&lt;/h2&gt;
&lt;p&gt;Without optimization, federated queries are slow. A naive approach reads all rows from every source, transfers them to the query engine, and performs joins in memory. For large operational tables, this is impractical.&lt;/p&gt;
&lt;p&gt;Predicate pushdown changes the pattern: the query engine analyzes the SQL, identifies filter conditions that can be applied at the source, and sends those filters as part of the source-specific subquery. The source returns only the rows that match the filter.&lt;/p&gt;
&lt;p&gt;For a query that joins Salesforce opportunities with Iceberg revenue data, filtered to opportunities created in the last 7 days in the North America region:&lt;/p&gt;
&lt;p&gt;Without pushdown:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fetch all Salesforce opportunities (millions of rows)&lt;/li&gt;
&lt;li&gt;Fetch all revenue data (terabytes)&lt;/li&gt;
&lt;li&gt;Apply filters in Dremio&apos;s memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With pushdown:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Send to Salesforce: &lt;code&gt;SELECT * FROM opportunities WHERE created_date &amp;gt; &apos;2026-05-21&apos; AND region = &apos;NA&apos;&lt;/code&gt; : returns thousands of rows&lt;/li&gt;
&lt;li&gt;Send to Iceberg: &lt;code&gt;SELECT * FROM revenue WHERE date &amp;gt; &apos;2026-05-21&apos; AND region = &apos;NA&apos;&lt;/code&gt; : reads only matching partitions&lt;/li&gt;
&lt;li&gt;Join the small result sets in Dremio&apos;s memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Predicate pushdown can reduce the data volume transferred by 99%+ for well-filtered queries. The query runs in seconds instead of minutes.&lt;/p&gt;
&lt;p&gt;Dremio pushes predicates to all source types it supports. The effectiveness depends on the source system&apos;s ability to process the predicate efficiently. Columnar sources like Parquet files and columnar databases handle pushdown very well. Row-oriented databases handle simple equality and range predicates well, but complex analytical predicates may be partially pushed.&lt;/p&gt;
&lt;h2&gt;Real-Time CRM and Lakehouse Joins: A Practical Pattern&lt;/h2&gt;
&lt;p&gt;The most common zero-ETL federation use case in agentic analytics is joining current CRM data with historical lakehouse data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; An agent is investigating why churned customers in Q1 had lower lifetime value than churned customers in previous quarters.&lt;/p&gt;
&lt;p&gt;The agent needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Current account status and churn date from Salesforce (real-time)&lt;/li&gt;
&lt;li&gt;Historical purchase history from the Iceberg lakehouse (historical)&lt;/li&gt;
&lt;li&gt;Customer support ticket history from Zendesk (real-time)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With zero-ETL federation, the agent writes one query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  s.account_id,
  s.churn_date,
  s.contract_value,
  SUM(r.revenue_usd) AS lifetime_revenue,
  COUNT(DISTINCT z.ticket_id) AS support_tickets_lifetime,
  AVG(z.resolution_days) AS avg_resolution_days
FROM salesforce.accounts s
  JOIN datalake.canonical_revenue r ON s.account_id = r.customer_id
  JOIN zendesk.tickets z ON s.account_id = z.account_id
WHERE s.status = &apos;churned&apos;
  AND s.churn_date BETWEEN &apos;2026-01-01&apos; AND &apos;2026-03-31&apos;
GROUP BY 1, 2, 3
ORDER BY lifetime_revenue DESC
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio executes this against all three sources simultaneously, applying date and status filters at each source, and joins the results. The agent gets a complete cross-system analysis without any pipeline building.&lt;/p&gt;
&lt;p&gt;The same query run yesterday in a batch ETL model would have required: a Salesforce extraction job, a Zendesk extraction job, waiting for both to land in the warehouse, and then running the analysis on potentially 24-hour-old data.&lt;/p&gt;
&lt;h2&gt;The Tradeoffs: When Federation Works and When It Doesn&apos;t&lt;/h2&gt;
&lt;p&gt;Zero-ETL federation is not the right answer for every data integration pattern.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Works well for:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Current-state analysis where the operational system data is the freshest available&lt;/li&gt;
&lt;li&gt;Low-to-medium volume operational joins (millions of rows, not billions)&lt;/li&gt;
&lt;li&gt;Exploratory or investigative queries run by agentic systems&lt;/li&gt;
&lt;li&gt;Cases where pipeline maintenance cost exceeds the query execution overhead&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Works less well for:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;High-frequency repetitive queries against slow source systems (federation overhead adds up)&lt;/li&gt;
&lt;li&gt;Very large volume joins where pushdown can&apos;t reduce data sufficiently&lt;/li&gt;
&lt;li&gt;Source systems with strict rate limits or connection limits that batch queries would exhaust&lt;/li&gt;
&lt;li&gt;Complex transformations that need to run at ingestion time, not at query time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s Reflections bridge some of the gap between federation and performance. When a federated query pattern runs frequently, create a Reflection that pre-computes it on a refresh schedule. The agent&apos;s interactive queries hit the Reflection; the source system isn&apos;t queried on every request.&lt;/p&gt;
&lt;p&gt;Hybrid architectures are the norm: real-time federation for current operational state, pre-computed Reflections for high-frequency analytical patterns, and Iceberg tables for historical data that has been cleaned and enriched.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/may28seo/zero-etl-vs-etl-decision.png&quot; alt=&quot;Zero-ETL vs traditional ETL federation decision matrix&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Federation as the AI Agent&apos;s Data Access Layer&lt;/h2&gt;
&lt;p&gt;For agentic analytics, federation is the mechanism that makes the AI agent a real-time analytical system rather than a historical one. The agent&apos;s ability to answer &amp;quot;what is happening right now&amp;quot; depends on federation reaching current operational systems. Its ability to answer &amp;quot;what happened in the past&amp;quot; depends on the historical lakehouse.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/why-agentic-analytics-requires-federation-virtualization-and-the-lakehouse-how-dremio-delivers/&quot;&gt;query federation&lt;/a&gt; connects both layers through the same interface, with the same semantic layer providing business context for both real-time and historical data. The agent doesn&apos;t need to know which source has which data : the unified namespace handles that routing.&lt;/p&gt;
&lt;p&gt;The era of zero-ETL federation means AI agents can answer business questions about current conditions and historical context in a single investigation, without waiting for a batch pipeline to complete.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and run your first zero-ETL federated query against your operational systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Use Hermes Agent for Free With DeepSeek V4 and Slack</title><link>https://iceberglakehouse.com/posts/2026-05-hermes-agent-free-deepseek-setup/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-hermes-agent-free-deepseek-setup/</guid><description>
Most AI agent frameworks lock you into a paid model. Claude Code needs an Anthropic subscription. Codex needs an OpenAI plan. Cursor costs $20 a mont...</description><pubDate>Mon, 25 May 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Most AI agent frameworks lock you into a paid model. Claude Code needs an Anthropic subscription. Codex needs an OpenAI plan. Cursor costs $20 a month. Hermes Agent from Nous Research works differently: it is a fully open-source agent framework that lets you plug in any inference provider you want.&lt;/p&gt;
&lt;p&gt;That means you can run a capable AI coding agent for exactly zero dollars by pointing it at DeepSeek V4 through the Nous Portal. And if you add Slack integration, you can talk to that agent from your phone, your browser, or wherever your team already chats.&lt;/p&gt;
&lt;p&gt;Here is how to do both.&lt;/p&gt;
&lt;h2&gt;What Hermes Agent Is&lt;/h2&gt;
&lt;p&gt;Hermes Agent is an open-source framework in the same category as Claude Code and OpenAI Codex. It runs in your terminal, answers questions, executes shell commands, edits files, searches the web, and delegates subtasks to child agents. What makes it different from the paid alternatives:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skills.&lt;/strong&gt; When Hermes solves a complex problem or learns a workflow, it can save that knowledge as a reusable skill. The next time you ask it to do something similar, it loads the skill and picks up where it left off. Over time, it gets better at your specific work without you teaching it the same thing twice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Memory.&lt;/strong&gt; It remembers who you are, your preferences, and your environment across sessions. You do not have to reintroduce your project structure or tooling every time you start a new conversation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-platform gateway.&lt;/strong&gt; The same agent that runs in your terminal can also run on Slack, Telegram, Discord, WhatsApp, and a dozen other platforms. You use the same tools and the same session history from every interface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Provider-agnostic.&lt;/strong&gt; Hermes works with 20+ inference providers. You can switch from DeepSeek to Claude to a local model mid-workflow. One config change, no architecture change.&lt;/p&gt;
&lt;p&gt;Every one of these features is free because the agent software itself is open source. You only pay for the model tokens, and the setup below shows you how to get those for free too.&lt;/p&gt;
&lt;h2&gt;Two Free Paths to DeepSeek V4&lt;/h2&gt;
&lt;p&gt;DeepSeek V4 Flash is a competitive reasoning and coding model. Through the right provider, you can access it without spending money.&lt;/p&gt;
&lt;h3&gt;Path 1: Nous Portal (recommended)&lt;/h3&gt;
&lt;p&gt;Nous Research, the same team that builds Hermes, runs the Nous Portal. It is a unified inference gateway that proxies models from across the ecosystem. One OAuth login gives you access to DeepSeek, Claude, GPT, Gemini, Qwen, and 300 other models, all billed against a single subscription.&lt;/p&gt;
&lt;p&gt;The free path uses DeepSeek V4 Flash through the Nous inference API. You do not need a paid subscription for this specific model. Set it up in two ways:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One-command setup:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;hermes setup --portal
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That runs the Portal OAuth flow, sets Nous as your inference provider in &lt;code&gt;config.yaml&lt;/code&gt;, and configures the gateway. You are ready to chat immediately after.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manual config:&lt;/strong&gt; If you already have credentials, set these values in &lt;code&gt;~/.hermes/.env&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;NOUS_API_KEY=your_key_here
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;model:
  default: deepseek/deepseek-v4-flash:free
  provider: nous
  base_url: https://inference-api.nousresearch.com/v1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run &lt;code&gt;hermes chat&lt;/code&gt; and you are talking to DeepSeek V4 through a free inference endpoint.&lt;/p&gt;
&lt;h3&gt;Path 2: OpenCode Zen&lt;/h3&gt;
&lt;p&gt;OpenCode Zen is a curated model marketplace that provides access to tested frontier models including GPT, Claude, Gemini, and others. It is pay-as-you-go priced but has free tier access for evaluation.&lt;/p&gt;
&lt;p&gt;To use it with Hermes, add to &lt;code&gt;~/.hermes/.env&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OPENCODE_ZEN_API_KEY=your_key_here
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And in config.yaml:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;model:
  default: gpt-4o
  provider: opencode-zen
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;OpenCode Zen is a solid alternative if you want access to OpenAI or Anthropic models without managing separate API keys. For purely free inference, the Nous Portal path is simpler and more direct.&lt;/p&gt;
&lt;h2&gt;How to Configure the Slack Gateway&lt;/h2&gt;
&lt;p&gt;Once your Hermes agent is running in the terminal, adding Slack integration takes about 10 minutes. The agent uses Socket Mode, which means it connects through a WebSocket instead of a public HTTP endpoint. That is important because it works behind firewalls, on your laptop, or on a private server without opening ports.&lt;/p&gt;
&lt;h3&gt;Step 1: Create a Slack App&lt;/h3&gt;
&lt;p&gt;Go to &lt;code&gt;api.slack.com/apps&lt;/code&gt; and click &lt;strong&gt;Create New App&lt;/strong&gt;. Choose &lt;strong&gt;From Scratch&lt;/strong&gt;, give it a name, and select your workspace.&lt;/p&gt;
&lt;h3&gt;Step 2: Enable Socket Mode&lt;/h3&gt;
&lt;p&gt;In the app settings, navigate to &lt;strong&gt;Socket Mode&lt;/strong&gt; and toggle it on. You will be prompted to create an App-Level Token. Do that and copy the token that starts with &lt;code&gt;xapp-&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Step 3: Add Bot Token Scopes&lt;/h3&gt;
&lt;p&gt;Go to &lt;strong&gt;OAuth &amp;amp; Permissions&lt;/strong&gt; and add these Bot Token Scopes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;channels:history&lt;/code&gt; - read channel messages&lt;/li&gt;
&lt;li&gt;&lt;code&gt;channels:read&lt;/code&gt; - see channel metadata&lt;/li&gt;
&lt;li&gt;&lt;code&gt;chat:write&lt;/code&gt; - send messages&lt;/li&gt;
&lt;li&gt;&lt;code&gt;app_mentions:read&lt;/code&gt; - detect when the bot is @mentioned&lt;/li&gt;
&lt;li&gt;&lt;code&gt;users:read&lt;/code&gt; - look up user info&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Subscribe to Events&lt;/h3&gt;
&lt;p&gt;Under &lt;strong&gt;Event Subscriptions&lt;/strong&gt;, enable events. Then add these bot events:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;message.channels&lt;/code&gt; - required for the bot to see messages in public channels&lt;/li&gt;
&lt;li&gt;&lt;code&gt;app_mention&lt;/code&gt; - respond to direct @mentions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without &lt;code&gt;message.channels&lt;/code&gt;, the bot will only see messages in DMs.&lt;/p&gt;
&lt;h3&gt;Step 5: Install the App&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;Install to Workspace&lt;/strong&gt; under OAuth &amp;amp; Permissions. Copy the Bot Token that starts with &lt;code&gt;xoxb-&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Step 6: Set Env Vars&lt;/h3&gt;
&lt;p&gt;Add these to &lt;code&gt;~/.hermes/.env&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;SLACK_BOT_TOKEN=xoxb-your-bot-token
SLACK_APP_TOKEN=xapp-your-app-token
SLACK_ALLOWED_USERS=U0XXXXXX
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;allowed_users&lt;/code&gt; field is a comma-separated list of Slack user IDs. Only users in this list can interact with the bot.&lt;/p&gt;
&lt;h3&gt;Step 7: Run the Gateway&lt;/h3&gt;
&lt;p&gt;The fast way to test:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;hermes gateway run
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a permanent setup that survives reboots:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;hermes gateway install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That installs it as a systemd service. The gateway starts automatically and reconnects if the WebSocket drops.&lt;/p&gt;
&lt;h2&gt;What You Get With Slack Integration&lt;/h2&gt;
&lt;p&gt;Once the gateway is running, your Slack workspace has a permanent AI agent with full tool access. From Slack you can:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Hermes from anywhere.&lt;/strong&gt; Your phone, your browser, your desktop - any device with the Slack app. No terminal required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Collaborate in teams.&lt;/strong&gt; Share the bot with your team. Everyone in the allowed users list can assign it tasks, ask questions, or request code reviews from the same agent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Full tool access.&lt;/strong&gt; The Slack interface is not a pared-down chatbot. It has the same toolset as the terminal version: file editing, terminal commands, web research, cron job scheduling, and subagent delegation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Persistent sessions.&lt;/strong&gt; Walk away from a conversation, come back on another device, and pick up where you left off. The session state is preserved in the gateway.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Zero infrastructure.&lt;/strong&gt; Because it uses Socket Mode, you do not need a public URL, a load balancer, or any cloud infrastructure. A laptop or a $5 VPS is sufficient.&lt;/p&gt;
&lt;h2&gt;Tradeoffs and Limitations&lt;/h2&gt;
&lt;p&gt;The free DeepSeek V4 Flash endpoint is a single model with rate limits. If you hit the ceiling, the agent returns an error instead of a response. You can work around this by adding a fallback provider in config.yaml - the agent will retry with a different model automatically.&lt;/p&gt;
&lt;p&gt;The Slack bot only responds when the gateway process is running. If your laptop goes to sleep or your server goes down, the bot goes quiet until it comes back. For 24/7 availability, deploy the gateway on a cheap always-on machine (a Raspberry Pi, an old laptop, or a $5 DigitalOcean droplet).&lt;/p&gt;
&lt;p&gt;Setting up the Slack App requires navigating Slack&apos;s API console, which has a reputation for confusing UX. The steps above cover the critical ones. If you miss &lt;code&gt;message.channels&lt;/code&gt;, the bot will appear to be online but will never see messages in public channels.&lt;/p&gt;
&lt;h2&gt;Recommended Approach&lt;/h2&gt;
&lt;p&gt;Start with the Nous Portal path. Run &lt;code&gt;hermes setup --portal&lt;/code&gt;, pick DeepSeek V4 Flash, and verify it works with &lt;code&gt;hermes chat&lt;/code&gt;. Use &lt;code&gt;hermes doctor&lt;/code&gt; to check that everything is healthy.&lt;/p&gt;
&lt;p&gt;Once the terminal workflow is solid, add the Slack gateway. Create the Slack App, set the env vars, and run &lt;code&gt;hermes gateway run&lt;/code&gt; to confirm the WebSocket connects. Then install it as a service with &lt;code&gt;hermes gateway install&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The total setup time is under 30 minutes, and the ongoing cost is zero. You get an AI agent with persistent memory, reusable skills, multi-platform access, and full system-level tooling, all running on free inference.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Automating Table Maintenance Before Small Files Accumulate</title><link>https://iceberglakehouse.com/posts/2026-05-24-automating-table-maintenance/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-automating-table-maintenance/</guid><description>
Table maintenance is one of those problems that feels manageable until it isn&apos;t. You run compaction manually when query performance degrades, schedul...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Table maintenance is one of those problems that feels manageable until it isn&apos;t. You run compaction manually when query performance degrades, schedule a VACUUM job after reports of slow planning times, and generally treat maintenance as reactive work. Then streaming pipelines arrive, partition counts multiply, and the files-per-partition metric climbs past the threshold where ad-hoc fixes stop working.&lt;/p&gt;
&lt;p&gt;The industry has moved on from treating compaction as an optional afterthought. Databricks made Predictive Optimization the default behavior for Unity Catalog managed tables in 2025. AWS S3 Tables provides continuous, automatic compaction for table bucket-stored Iceberg tables, and reduced processing fees for those operations by up to 90% in July 2025. As a result, manual maintenance scheduling is becoming the exception, not the norm.&lt;/p&gt;
&lt;p&gt;This post covers what table maintenance actually does, why the small-file problem is specifically painful for Iceberg, and how the current generation of automated and policy-driven approaches changes the operational model.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What the Small File Problem Actually Costs&lt;/h2&gt;
&lt;p&gt;Every write to an Apache Iceberg table creates new data files. A batch ETL job that appends 1 GB of data might create 8 x 128 MB Parquet files, reasonable. A Flink streaming job with a 5-minute checkpoint interval writing to 20 partitions creates at least 20 files every 5 minutes. Over 24 hours, that&apos;s 5,760 files, none of which are large enough to be efficient for columnar analytics.&lt;/p&gt;
&lt;p&gt;The cost is not primarily storage. S3 pricing at scale makes small files a storage non-issue. The cost is query planning and scan performance.&lt;/p&gt;
&lt;p&gt;Iceberg query planners read manifest files to determine which data files are relevant to a query. Each manifest entry is a file reference with column-level statistics (min/max values, null counts). When a planner needs to determine which files might contain rows matching a predicate, it scans manifest entries. With 100 files per partition, this is fast. With 10,000 files per partition, the metadata scan itself becomes the bottleneck; often adding seconds to planning time even before a single data byte is read.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/automating-table-maintenance/small-files-query-performance-impact.png&quot; alt=&quot;Bar chart showing query planning overhead increasing from 1x at 10 files per partition to 9.5x at 10,000 files, with performance threshold at 2x indicating when compaction is needed&quot;&gt;&lt;/p&gt;
&lt;p&gt;The secondary cost is snapshot accumulation. Every commit to an Iceberg table creates a new snapshot, which references the current manifest list. Streaming pipelines create hundreds of snapshots per day. Without regular snapshot expiration, the metadata tree grows indefinitely, slowing time-travel queries and increasing catalog scan times.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Four Operations of Iceberg Maintenance&lt;/h2&gt;
&lt;p&gt;Iceberg table maintenance is four distinct operations, each addressing a different part of the file and metadata accumulation problem.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/automating-table-maintenance/iceberg-table-maintenance-lifecycle.png&quot; alt=&quot;Circular lifecycle diagram showing Iceberg table maintenance stages: Data Ingestion, File Accumulation, Compaction (RewriteDataFiles), and Snapshot Expiration (VACUUM) cycling through Iceberg Table Health at center&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RewriteDataFiles (Compaction).&lt;/strong&gt; This is the core operation. It reads multiple small Parquet files from a table partition and rewrites them as fewer, larger files. The operation is transparent to readers: Iceberg commits the new files and the removal of the originals as an atomic snapshot update. The table remains readable throughout. The standard target file size is 128 MB to 256 MB, which balances read efficiency against write amplification.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RewriteManifests.&lt;/strong&gt; Over time, manifest files accumulate entries for both live and expired files. This operation rewrites the manifest list to clean up stale entries and rebalance entry distribution. It&apos;s cheaper than data compaction but often overlooked. Manifest rewriting reduces planning overhead independent of data file sizes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ExpireSnapshots.&lt;/strong&gt; This removes snapshot metadata that is no longer accessible for time-travel queries within your retention window. Critically, it doesn&apos;t actually delete the underlying data files, that happens in the next step. Snapshot expiration removes the pointer to old file sets, not the files themselves.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DeleteOrphanFiles.&lt;/strong&gt; This removes data files that are no longer referenced by any snapshot. Orphaned files accumulate from failed or partial writes. Running this operation periodically ensures storage doesn&apos;t silently grow from write failures.&lt;/p&gt;
&lt;p&gt;The recommended maintenance sequence is: compact first, expire snapshots second, delete orphans last. Running in this order ensures compaction completes before you remove the snapshot history that the compacted files are referenced by.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Databricks Predictive Optimization: Autonomous Maintenance&lt;/h2&gt;
&lt;p&gt;Databricks&apos; Predictive Optimization changes the maintenance model from &amp;quot;schedule jobs and hope&amp;quot; to &amp;quot;define policy and let the platform decide.&amp;quot;&lt;/p&gt;
&lt;p&gt;Enabled by default for Unity Catalog managed tables in 2025, Predictive Optimization monitors query patterns and table statistics using Databricks&apos; platform intelligence layer. Rather than running compaction on a fixed schedule regardless of whether a table actually needs it, the system analyzes each table&apos;s write volume, query frequency, and file count metrics to decide when maintenance provides sufficient performance benefit to justify the compute cost.&lt;/p&gt;
&lt;p&gt;The operations it manages automatically are OPTIMIZE (compaction), VACUUM (snapshot expiration and orphan cleanup), and ANALYZE (statistics refresh for query planning). All three run asynchronously using serverless compute, independent of any job the team is actively running.&lt;/p&gt;
&lt;p&gt;For Unity Catalog managed tables, enabling Predictive Optimization requires no manual configuration after the feature is enabled at the metastore level. For tables where you want explicit control, you can exclude specific tables from automatic optimization:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Enable predictive optimization for a table explicitly
ALTER TABLE my_catalog.my_schema.events
SET TBLPROPERTIES (&apos;delta.enableAutoOptimize&apos; = &apos;true&apos;);

-- Or disable for tables where you manage maintenance manually
ALTER TABLE my_catalog.my_schema.manual_table
SET TBLPROPERTIES (&apos;delta.enableAutoOptimize&apos; = &apos;false&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Predictive Optimization is distinct from Auto Compaction. Auto Compaction runs on the same cluster performing writes, merging small files immediately after they are created. Predictive Optimization is a background service that analyzes and acts asynchronously. Both can run simultaneously. The combination addresses both the immediate small-file production problem (Auto Compaction) and the longer-term layout and statistics management problem (Predictive Optimization).&lt;/p&gt;
&lt;p&gt;The tradeoff: because Databricks is making the maintenance decision rather than the platform team, you have less visibility into when compaction runs and which tables are being prioritized. The system provides audit logs, but teams accustomed to explicit maintenance job monitoring may find the autonomous model less transparent than they&apos;d like.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;AWS S3 Tables: Managed Maintenance for Table Buckets&lt;/h2&gt;
&lt;p&gt;Amazon S3 Tables provides fully managed Apache Iceberg table storage where compaction and cleanup are built into the storage service rather than delegated to the compute engine. When you store Iceberg tables in S3 Table Buckets (not standard S3 buckets), AWS runs compaction continuously using three strategies:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Binpack compaction:&lt;/strong&gt; The default. Merges files targeting a configurable size, typically 128 MB to 512 MB. This is the standard compaction approach equivalent to Iceberg&apos;s &lt;code&gt;RewriteDataFiles&lt;/code&gt; with the binpack strategy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sort compaction:&lt;/strong&gt; Applied automatically when a sort order is defined in the table metadata. Organizes data within files by the sort column, improving predicate pushdown performance for range queries on sorted columns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Z-order compaction:&lt;/strong&gt; Enabled through the &lt;code&gt;put-table-maintenance-configuration&lt;/code&gt; API for workloads requiring efficient pruning across multiple columns simultaneously. Z-order reorders data spatially so that records with similar values across multiple columns are physically co-located in the same files.&lt;/p&gt;
&lt;p&gt;In July 2025, AWS reduced S3 Tables compaction processing fees by up to 90% for binpack operations and 80% for sort and z-order compaction. For teams that were previously avoiding S3 Tables due to per-operation pricing, this makes the managed maintenance model substantially more cost-competitive with self-managed Iceberg on standard S3.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Configure target file size for S3 Tables compaction
aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:123456789:bucket/my-table-bucket \
    --namespace analytics \
    --name events \
    --type icebergCompaction \
    --value &apos;{&amp;quot;status&amp;quot;: &amp;quot;enabled&amp;quot;, &amp;quot;settings&amp;quot;: {&amp;quot;icebergCompaction&amp;quot;: {&amp;quot;targetFileSizeMB&amp;quot;: 256}}}&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The S3 Tables maintenance model has one significant constraint: it only applies to tables stored in S3 Table Buckets. Iceberg tables stored in standard S3 general-purpose buckets don&apos;t receive automatic maintenance. Those tables require either self-managed Spark or Flink maintenance jobs, or Glue Data Catalog–based compaction configuration.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Self-Managed Maintenance with Spark and Iceberg APIs&lt;/h2&gt;
&lt;p&gt;For teams that can&apos;t use managed maintenance services (whether due to cloud provider, cost structure, or operational preference), the Iceberg Java API provides direct maintenance actions that can be wrapped in Spark or Flink jobs.&lt;/p&gt;
&lt;p&gt;The standard pattern for compaction using the Iceberg Spark Actions API:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession
from org.apache.iceberg.spark.actions import SparkActions
from org.apache.iceberg.expressions import Expressions

spark = SparkSession.builder.getOrCreate()
table = spark.catalog.loadTable(&amp;quot;catalog.analytics.events&amp;quot;)

# Run compaction on historical partitions (not the hot partition)
rewrite_result = SparkActions.get() \
    .rewriteDataFiles(table) \
    .option(&amp;quot;target-file-size-bytes&amp;quot;, str(128 * 1024 * 1024)) \
    .option(&amp;quot;partial-progress.enabled&amp;quot;, &amp;quot;true&amp;quot;) \
    .filter(
        Expressions.lessThan(&amp;quot;event_date&amp;quot;, &amp;quot;2025-05-23&amp;quot;)
    ) \
    .execute()

print(f&amp;quot;Compacted {rewrite_result.rewrittenDataFilesCount()} files into &amp;quot;
      f&amp;quot;{rewrite_result.addedDataFilesCount()} new files&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;filter&lt;/code&gt; parameter is critical for streaming tables. Always exclude the partition currently receiving writes, the &amp;quot;hot&amp;quot; partition. If compaction attempts to rewrite files in a partition where a streaming job is actively writing, the commit can conflict, failing the compaction job and potentially causing the streaming job to retry or fail.&lt;/p&gt;
&lt;p&gt;For snapshot management:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from org.apache.iceberg.spark.actions import SparkActions
import datetime

# Expire snapshots older than 7 days, retain at least 5 snapshots
expire_result = SparkActions.get() \
    .expireSnapshots(table) \
    .expireOlderThan(
        (datetime.datetime.now() - datetime.timedelta(days=7)).timestamp() * 1000
    ) \
    .retainLast(5) \
    .execute()

print(f&amp;quot;Deleted {expire_result.deletedDataFilesCount()} orphaned data files&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule these maintenance jobs with awareness of the write schedule. Running compaction during peak write periods competes for I/O and compute resources. The standard recommendation is off-peak maintenance windows (late night or early morning), with compaction running hourly or every few hours for active streaming tables.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Comparing the Approaches&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/automating-table-maintenance/manual-vs-automated-maintenance-comparison.png&quot; alt=&quot;Comparison table showing manual scheduling versus automated (Predictive/S3 Tables) maintenance across six dimensions: trigger, scope, cost awareness, configuration, hot partition risk, and streaming table support&quot;&gt;&lt;/p&gt;
&lt;p&gt;The right choice depends on your operating model and platform. Teams on Databricks Unity Catalog should adopt Predictive Optimization for all managed tables, the default behavior requires no configuration and the cost-aware scheduling avoids running maintenance on tables that don&apos;t need it. Teams on AWS building new Iceberg infrastructure should evaluate S3 Tables for workloads where the 90% cost reduction on compaction processing makes the managed model economically competitive.&lt;/p&gt;
&lt;p&gt;Self-managed maintenance remains the appropriate choice for teams with multi-cloud platforms, strict control requirements, or existing operational processes built around Airflow DAGs and Spark jobs. The Iceberg Actions API is production-grade and well-documented. The operational cost is the scheduling complexity and the risk of hot partition conflicts if exclusion logic isn&apos;t implemented carefully.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Table maintenance is no longer optional when you&apos;re running streaming pipelines into Iceberg tables. The question is whether you implement it reactively, proactively on a fixed schedule, or through an adaptive platform layer that decides when and what to compact based on actual usage patterns.&lt;/p&gt;
&lt;p&gt;If you&apos;re starting fresh on Databricks or AWS, the automated options have matured to the point where self-managed maintenance is harder to justify on engineering time alone. If you&apos;re self-managing, set a file-count monitoring alert at 500 files per partition and treat that as the trigger for a compaction run. Don&apos;t wait for query performance to tell you there&apos;s a problem.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Monitoring Compaction Health with Key Metrics&lt;/h2&gt;
&lt;p&gt;Effective compaction management requires monitoring the right metrics. Compaction is a reactive operation; you want to trigger it based on leading indicators rather than waiting for query degradation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Files per partition.&lt;/strong&gt; This is the primary leading indicator. Iceberg metadata exposes this through the &lt;code&gt;files&lt;/code&gt; system table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Track file count per partition in an Iceberg table
SELECT
    partition,
    COUNT(*) as file_count,
    SUM(file_size_in_bytes) / (1024 * 1024) as total_size_mb,
    AVG(file_size_in_bytes) / (1024 * 1024) as avg_file_size_mb
FROM my_catalog.analytics.events.files
GROUP BY partition
ORDER BY file_count DESC
LIMIT 20;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Snapshot count.&lt;/strong&gt; Tables receiving frequent streaming writes accumulate snapshots quickly. Track snapshot count to determine when expiration is needed:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Check snapshot count and age distribution
SELECT
    COUNT(*) as total_snapshots,
    MIN(committed_at) as oldest_snapshot,
    MAX(committed_at) as newest_snapshot,
    DATEDIFF(day, MIN(committed_at), MAX(committed_at)) as age_days
FROM my_catalog.analytics.events.history;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Compaction effectiveness ratio.&lt;/strong&gt; Compare files-before vs files-after for completed compaction runs. The ideal ratio is 50+ input files per 1 output file. Low ratios (10:1) indicate compaction is running too frequently on tables that don&apos;t have sufficient small-file accumulation.&lt;/p&gt;
&lt;p&gt;Building a simple dashboard from these three metrics (files per partition, snapshot count, and compaction ratio), gives maintenance teams visibility into table health without requiring deep inspection of individual files.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Z-Order Optimization for Multi-Dimensional Queries&lt;/h2&gt;
&lt;p&gt;Standard binpack compaction merges small files without changing their sort order. Z-order compaction applies a space-filling curve to reorder data within merged files, co-locating rows with similar values across multiple dimensions.&lt;/p&gt;
&lt;p&gt;For analytics tables where queries frequently filter on two or more columns, Z-order can dramatically improve predicate pushdown effectiveness. A table storing web events that is queried by both &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;event_date&lt;/code&gt; benefits from Z-order on both columns:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Apply Z-order compaction on Databricks Delta (equivalent pattern)
OPTIMIZE events
ZORDER BY (user_id, event_date);

-- On Iceberg with sort order, set at table creation or with alter
ALTER TABLE my_catalog.analytics.events
WRITE ORDERED BY (user_id, event_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The trade-off with Z-order compaction is higher write cost than binpack. Z-order requires a full sort pass before writing output files, which uses more memory and compute per file rewritten. For high-churn tables receiving continuous streaming writes, running Z-order on all new data is impractical. The recommended pattern is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hot, recent data:&lt;/strong&gt; Binpack compaction to merge small files quickly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cold, historical data:&lt;/strong&gt; Z-order compaction to optimize for read performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This tiered approach applies Z-order only where the read performance benefit justifies the higher compaction cost; typically data older than a few days that is well-established in the table and unlikely to receive further updates.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Partition Evolution: Planning for Growth&lt;/h2&gt;
&lt;p&gt;One of the most expensive compaction scenarios is a poorly designed partition strategy that requires a full table rewrite to fix. Iceberg&apos;s partition evolution feature allows changing a table&apos;s partitioning scheme without rewriting existing files, new files use the new scheme while old files retain their original partition structure.&lt;/p&gt;
&lt;p&gt;Updating a table from daily partitioning to hourly partitioning as data volume grows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Add a new partition field without rewriting existing data
ALTER TABLE my_catalog.analytics.events
ADD PARTITION FIELD hour(event_time);

-- Remove the old daily partition field from new writes
-- (existing files still use the old partition, new files use the new one)
ALTER TABLE my_catalog.analytics.events
DROP PARTITION FIELD days(event_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After partition evolution, old data files retain the daily partition structure while new files use the hourly partition. Iceberg&apos;s hidden partitioning ensures queries remain transparent, the planner handles both partition schemes simultaneously. Over time, natural turnover (through retention policies or explicit rewrites) eliminates the old partition format from the table.&lt;/p&gt;
&lt;p&gt;Planning partition strategy before a table reaches scale avoids the costly alternative: exporting all data, dropping the table, recreating with the correct partition scheme, and re-ingesting everything. Iceberg&apos;s partition evolution is one of the format&apos;s most operationally valuable features for growing data platforms.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Learn More About Lakehouse Operations&lt;/h3&gt;
&lt;p&gt;To build deeper expertise in lakehouse architecture, compaction strategies, and open table format operations, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio automatically handles reflection refreshes and query acceleration on top of your Iceberg tables without requiring manual materialization management. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Choosing the Right Iceberg Control Plane: Polaris vs. Unity Catalog vs. Cloud REST</title><link>https://iceberglakehouse.com/posts/2026-05-24-choosing-iceberg-control-plane/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-choosing-iceberg-control-plane/</guid><description>
Modern data architecture is undergoing a quiet but fundamental shift. For years, teams focused on choosing the right open table format, debating the ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Modern data architecture is undergoing a quiet but fundamental shift. For years, teams focused on choosing the right open table format, debating the file-level mechanics of Delta Lake versus Apache Iceberg. Today, that format debate is largely settled by metadata interoperability.&lt;/p&gt;
&lt;p&gt;Databricks added native Iceberg support in June 2025, Google Cloud enabled BigQuery read/write interoperability with Iceberg-compatible engines in April 2026, and Snowflake continues to establish open catalogs as standard enterprise infrastructure. The real battleground has moved up the stack from file formats to the metadata control plane.&lt;/p&gt;
&lt;p&gt;When multiple compute engines, such as Apache Spark for batch ETL, Apache Flink for real-time streaming, Trino for ad-hoc SQL, and Snowflake or BigQuery for enterprise BI, need to read and write to the same shared files simultaneously, they cannot rely on local file structures. They need a centralized authority to coordinate table updates, track snapshots, and enforce security policies.&lt;/p&gt;
&lt;p&gt;This centralized coordinator is the Apache Iceberg catalog. Selecting the right catalog control plane is now the most critical design decision in lakehouse engineering, dictating your platform&apos;s security boundaries, cloud costs, and multi-engine interoperability.&lt;/p&gt;
&lt;p&gt;To understand why this layer has become so strategic, you must look at how the data lakehouse has evolved. The first wave of lakehouse design separated compute from storage. You stored your data as Parquet files in an open S3 bucket and spun up compute engines dynamically to run queries.&lt;/p&gt;
&lt;p&gt;However, this model created a metadata vacuum. Because S3 has no native understanding of schemas, transaction records, or table partitions, every engine had to maintain its own list of what files made up a table. The second wave of the lakehouse is about separating metadata from compute and storage. The catalog control plane is that decoupled metadata layer.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/choosing-iceberg-control-plane/control-plane-architecture.png&quot; alt=&quot;Architecture diagram showing multiple engines like Spark, Flink, Trino, and Snowflake communicating with a unified Iceberg REST Catalog control plane which delegates to S3 storage&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Decoupled Interoperability: The Iceberg REST Catalog Standard&lt;/h2&gt;
&lt;p&gt;In early data lake designs, engines interacted directly with physical storage catalogs. The query engine read metadata files directly from S3 or polled a Hive Metastore database to discover which Parquet files belonged to a table.&lt;/p&gt;
&lt;p&gt;This tightly coupled design introduced several structural problems. Every engine had to implement its own lock management and metadata parsing logic, which frequently led to commit collisions, read-write drift, and vendor lock-in.&lt;/p&gt;
&lt;p&gt;The Apache Iceberg REST Catalog specification decouples the compute engine from metadata management. It defines a standardized, language-agnostic OpenAPI specification for catalog operations.&lt;/p&gt;
&lt;p&gt;Under this model, the query engine never inspects raw storage directories to determine table states. Instead, it sends standard HTTP requests to a REST Catalog endpoint:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Query Engine (Spark/Trino) ──► GET /v1/namespaces/db/tables/events ──► REST Catalog Server
Query Engine (Spark/Trino) ◄── [JSON Table Metadata &amp;amp; Storage Token] ◄── REST Catalog Server
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The REST Catalog server handles the request, interacts with the underlying metadata database, and returns a JSON payload containing the table&apos;s schema, partition specifications, current snapshot details, and temporary security credentials to access the data files.&lt;/p&gt;
&lt;p&gt;This API-first design provides several platform benefits. It standardizes catalog operations across diverse compute engines, allowing a Spark write commit and a Trino read request to use the exact same catalog interface.&lt;/p&gt;
&lt;p&gt;It centralizes transaction management, allowing the REST server to handle commit conflicts and enforce Optimistic Concurrency Control (OCC) without relying on engine-specific file locking.&lt;/p&gt;
&lt;p&gt;Finally, it secures storage access through credential vending. The catalog server issues temporary, scoped access tokens (such as AWS STS tokens) to the compute engine for a specific table path, avoiding the security risk of sharing root-level storage credentials with query clients.&lt;/p&gt;
&lt;p&gt;Additionally, standardizing the JSON response schema for table metadata ensures that secondary metadata details (such as column statistics, sort orders, and partition specs) are interpreted identically by all reading engines.&lt;/p&gt;
&lt;p&gt;Without this standard interface, differences in how engines (like Trino versus Spark) parsed physical metadata files often led to query plan mismatches. The REST catalog eliminates this discrepancy by acting as the single, authoritative interpreter of table state.&lt;/p&gt;
&lt;p&gt;A standard Iceberg REST Catalog response for a table request contains structured sections detailing the table state:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;metadata-location&amp;quot;: &amp;quot;s3://my-lakehouse/metadata/00001-abc.metadata.json&amp;quot;,
  &amp;quot;metadata&amp;quot;: {
    &amp;quot;format-version&amp;quot;: 2,
    &amp;quot;table-uuid&amp;quot;: &amp;quot;a1b2c3d4-e5f6-7a8b-9c0d-1e2f3a4b5c6d&amp;quot;,
    &amp;quot;location&amp;quot;: &amp;quot;s3://my-lakehouse/data&amp;quot;,
    &amp;quot;last-updated-ms&amp;quot;: 1779629242000,
    &amp;quot;last-column-id&amp;quot;: 3,
    &amp;quot;schemas&amp;quot;: [
      {
        &amp;quot;type&amp;quot;: &amp;quot;struct&amp;quot;,
        &amp;quot;fields&amp;quot;: [
          { &amp;quot;id&amp;quot;: 1, &amp;quot;name&amp;quot;: &amp;quot;id&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot; },
          { &amp;quot;id&amp;quot;: 2, &amp;quot;name&amp;quot;: &amp;quot;event_date&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;date&amp;quot; },
          {
            &amp;quot;id&amp;quot;: 3,
            &amp;quot;name&amp;quot;: &amp;quot;metric_value&amp;quot;,
            &amp;quot;required&amp;quot;: false,
            &amp;quot;type&amp;quot;: &amp;quot;double&amp;quot;
          }
        ]
      }
    ],
    &amp;quot;current-schema-id&amp;quot;: 0,
    &amp;quot;partition-specs&amp;quot;: [
      {
        &amp;quot;spec-id&amp;quot;: 0,
        &amp;quot;fields&amp;quot;: [
          {
            &amp;quot;source-id&amp;quot;: 2,
            &amp;quot;field-id&amp;quot;: 1000,
            &amp;quot;name&amp;quot;: &amp;quot;event_date&amp;quot;,
            &amp;quot;transform&amp;quot;: &amp;quot;identity&amp;quot;
          }
        ]
      }
    ],
    &amp;quot;default-spec-id&amp;quot;: 0,
    &amp;quot;last-partition-id&amp;quot;: 1000,
    &amp;quot;snapshots&amp;quot;: [
      {
        &amp;quot;snapshot-id&amp;quot;: 987654321,
        &amp;quot;timestamp-ms&amp;quot;: 1779629242000,
        &amp;quot;summary&amp;quot;: {
          &amp;quot;operation&amp;quot;: &amp;quot;append&amp;quot;,
          &amp;quot;added-data-files&amp;quot;: &amp;quot;4&amp;quot;
        },
        &amp;quot;manifest-list&amp;quot;: &amp;quot;s3://my-lakehouse/metadata/snap-987654321.manifest.list.avro&amp;quot;
      }
    ],
    &amp;quot;current-snapshot-id&amp;quot;: 987654321
  },
  &amp;quot;config&amp;quot;: {
    &amp;quot;client.factory&amp;quot;: &amp;quot;org.apache.iceberg.rest.auth.OAuth2Client&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This clean structure isolates the reading client from having to query physical directories or parse raw metadata files. The catalog server performs the lookups and serves the exact logical plan foundations in a unified JSON response.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Independent and Open: Apache Polaris Architecture&lt;/h2&gt;
&lt;p&gt;Apache Polaris is an open-source, vendor-neutral metadata control plane designed for Apache Iceberg tables. Polaris graduated to a Top-Level Project (TLP) at the Apache Software Foundation (ASF) on February 18, 2026. This independent status ensures that Polaris operates under community-driven governance, free from single-vendor lock-in.&lt;/p&gt;
&lt;p&gt;Architecturally, Polaris acts as a stateless REST Catalog server that communicates with a backend metadata database (such as PostgreSQL, MySQL, or CockroachDB). Polaris provides a unified namespace where you can manage Iceberg tables and register external catalog sources.&lt;/p&gt;
&lt;p&gt;Polaris implements a zero-trust security model centered on temporary credential vending. When a client engine queries a table, Polaris does not share raw IAM keys. Instead, the Polaris server negotiates temporary tokens (such as AWS STS scoped sessions or Google Cloud Service Account impersonations) that allow the engine to access only the specific storage path linked to that table.&lt;/p&gt;
&lt;p&gt;The v1.4 release of Apache Polaris (April 2026) introduced several updates designed for production deployments:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AWS STS Session Tag Customization:&lt;/strong&gt; Platform administrators can now map specific catalog parameters (such as the Polaris realm, catalog name, or database name) directly to AWS STS session tags. When an engine reads storage, these tags propagate to AWS CloudTrail, providing audit logs that tie S3 file operations back to specific catalog tables and users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage-Scoped Key Management:&lt;/strong&gt; Polaris enables storage-scoped credential vending down to the individual table prefix. This means separate tables in the same storage bucket can be encrypted with distinct KMS keys, allowing administrators to restrict access at the bucket level while delegating keys dynamically based on catalog RBAC roles.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metrics Persistence:&lt;/strong&gt; Polaris now supports persisting query execution metrics and commit statistics directly to its catalog database. This enables teams to monitor read-write patterns, track catalog performance, and identify slow commits across multiple engines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CockroachDB Backend Integration:&lt;/strong&gt; The database storage layer has been optimized to support CockroachDB, providing horizontally-scalable metadata storage for high-concurrency enterprise catalogs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gateway API Support:&lt;/strong&gt; Helm charts have been updated to support the Kubernetes Gateway API, simplifying ingress routing and certificate management in containerized environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UV Packaging:&lt;/strong&gt; The Python packaging and dependency infrastructure switched from Poetry to UV, significantly reducing the build and deployment times of custom Polaris clients and CLI tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To configure an Apache Spark session to connect to a Polaris server using the standard Iceberg REST Catalog API, you define the catalog properties in your configuration file or code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession

# Initialize Spark Session with Apache Polaris REST Catalog configuration
spark = SparkSession.builder \
    .appName(&amp;quot;PolarisCatalogConnection&amp;quot;) \
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.polaris&amp;quot;, &amp;quot;org.apache.iceberg.spark.SparkCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.polaris.type&amp;quot;, &amp;quot;rest&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.polaris.uri&amp;quot;, &amp;quot;http://polaris-server.data-platform.local:8181/api/catalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.polaris.credential&amp;quot;, &amp;quot;client_id_123:client_secret_xyz&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.polaris.warehouse&amp;quot;, &amp;quot;my_s3_warehouse&amp;quot;) \
    .getOrCreate()

# Query an Iceberg table managed by Polaris
df = spark.read.table(&amp;quot;polaris.db.sales_records&amp;quot;)
df.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This open architecture makes Polaris the preferred catalog control plane for organizations building multi-engine, multi-cloud platforms using standard open-source technologies.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Serverless Simplicity: Snowflake Open Catalog &amp;amp; Horizon&lt;/h2&gt;
&lt;p&gt;For teams that want the interoperability of Apache Polaris but do not want to manage the operational overhead of running a self-hosted metadata server, Snowflake provides a fully managed implementation: &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Open Catalog is a serverless, managed version of the Polaris core engine. It retains 100% API compatibility with open-source Polaris, ensuring that you can migrate between self-hosted Polaris and Snowflake-managed instances without changing client code or rewriting metadata schemas. Snowflake charges for this catalog on a pay-per-request billing model scheduled for rollout in the first half of 2026.&lt;/p&gt;
&lt;p&gt;Within the Snowflake ecosystem, Open Catalog serves as the bridge to &lt;strong&gt;Snowflake Horizon&lt;/strong&gt;. Horizon is Snowflake&apos;s broader compliance, security, and data governance platform.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;External Engine (Trino/Spark) ──► Open Catalog (Polaris API)
                                            │
                                            ▼
                              Snowflake Horizon Governance
                                            │
                                            ├─► Row-Level Security
                                            └─► Column Masking
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Horizon integrates with Open Catalog to enforce data access policies across heterogeneous compute environments. You can define access policies, such as row-level filters and column-level masking rules, inside Snowflake using standard SQL.&lt;/p&gt;
&lt;p&gt;When an external query engine (like Apache Spark or Trino) calls the Open Catalog REST API to plan a query, Horizon intercepts the request, evaluates the user&apos;s role and database permissions, and down-scopes the returned Iceberg metadata.&lt;/p&gt;
&lt;p&gt;The external engine receives only the specific data files and columns the user is authorized to view. This pattern enforces consistent, unified governance across all query tools without requiring you to duplicate policy definitions in every engine.&lt;/p&gt;
&lt;p&gt;For example, if you define a masking policy on a column named &lt;code&gt;social_security_number&lt;/code&gt; to only show the last four digits, the policy is evaluated at the metadata level during query planning.&lt;/p&gt;
&lt;p&gt;When Trino requests the list of manifest files, Snowflake Horizon intercepts the call and modifies the returned schema metadata. The external engine does not receive the raw columns or physical data addresses for the masked data, preventing unauthorized reading of the physical Parquet storage blocks.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Multi-Format Security: Open-Source Unity Catalog&lt;/h2&gt;
&lt;p&gt;While Polaris and Snowflake Open Catalog focus exclusively on the Apache Iceberg format, Databricks has taken a multi-format approach by open-sourcing &lt;strong&gt;Unity Catalog&lt;/strong&gt;. Licensed under the Apache 2.0 license and managed as a sandbox project under the LF AI &amp;amp; Data Foundation, open-source Unity Catalog acts as a unified catalog for Delta Lake, Apache Iceberg, Apache Hudi, and unstructured files.&lt;/p&gt;
&lt;p&gt;Unity Catalog provides a metadata governance layer that extends beyond basic table files. It tracks data lineage, registers machine learning models, manages volumes (unstructured files), and handles access control policies.&lt;/p&gt;
&lt;p&gt;Key technical milestones for open-source Unity Catalog in 2026 include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalog Commits (GA 2026):&lt;/strong&gt; Coordinates write transactions directly at the catalog layer. Instead of engines writing transaction files directly to S3 and risking conflicts, the catalog commits changes atomically. This eliminates race conditions when multiple engines (such as Spark and Flink) write to the same tables concurrently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business Semantics Open Sourcing (April 2026):&lt;/strong&gt; Databricks open-sourced the core implementation of its Business Semantics layer. This framework allows developers to define governed metrics, dimensions, and logic inside the catalog using an open format. These definitions integrate directly with Apache Spark, translating natural-language queries from BI tools and AI agents into deterministic SQL queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Sharing integration:&lt;/strong&gt; Includes built-in support for Delta Sharing, providing secure, real-time sharing of tables and ML models across clouds and platforms without copying physical data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To access tables managed by an open-source Unity Catalog server using a Python client (such as DuckDB), you can register the catalog connection directly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import duckdb

# Connect to a local Unity Catalog server using the REST API
con = duckdb.connect()
con.execute(&amp;quot;INSTALL uc_catalog;&amp;quot;)
con.execute(&amp;quot;LOAD uc_catalog;&amp;quot;)

# Register the Unity Catalog server endpoint
con.execute(&amp;quot;&amp;quot;&amp;quot;
    CREATE SECRET (
        TYPE UC,
        TOKEN &apos;uc_token_abc123&apos;,
        ENDPOINT &apos;http://unity-server.data-platform.local:8080&apos;
    );
&amp;quot;&amp;quot;&amp;quot;)

# Query a Delta or Iceberg table managed by Unity Catalog
df = con.execute(&amp;quot;SELECT * FROM unity_catalog.main.db.inventory&amp;quot;).fetchdf()
print(df.head())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Unity Catalog&apos;s multi-format support makes it a strong option for teams managing mixed data environments containing both Delta Lake and Apache Iceberg tables.&lt;/p&gt;
&lt;p&gt;The catalog commits mechanism is particularly significant for multi-engine architectures. In traditional setups, if Spark and Flink attempt to write to the same table concurrently, they rely on basic file-level locking, which can result in write collisions and orphaned data files.&lt;/p&gt;
&lt;p&gt;Unity Catalog&apos;s commit service coordinates these operations, checking table schemas and transaction sequences before applying the changes. This ensures ACID compliance and prevents write failures across different processing frameworks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/choosing-iceberg-control-plane/catalog-governance-comparison.png&quot; alt=&quot;Stack diagram showing the security and catalog commit layers of modern Iceberg control planes including RBAC, credential vending, and multi-engine commits&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Cloud Native Scoping: AWS Glue &amp;amp; Google Cloud Managed REST&lt;/h2&gt;
&lt;p&gt;In addition to open-source and serverless options, the major cloud providers offer native managed REST endpoints for Iceberg catalogs.&lt;/p&gt;
&lt;h3&gt;AWS Glue Iceberg REST Catalog&lt;/h3&gt;
&lt;p&gt;AWS Glue provides a managed, serverless REST Catalog endpoint for Iceberg tables. This catalog integrates directly with the AWS security and metadata stack, including AWS IAM for authentication, AWS Lake Formation for column-level access control, and Amazon S3.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Implementation:&lt;/em&gt; Computes (such as AWS Athena, Amazon EMR, and AWS Glue Jobs) connect to the REST endpoint using IAM role authentication. Glue handles metadata commits, coordinates transactional writes, and manages metadata file scaling.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Workload Fit:&lt;/em&gt; This option is ideal for AWS-centric data platforms. If your storage, computing, and BI tools are hosted entirely inside AWS, Glue provides a zero-maintenance, serverless control plane. However, accessing this catalog from external clouds (such as Google Cloud or Azure) requires setting up complex IAM federation and cross-account credentials.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Security Details:&lt;/em&gt; By linking AWS Glue to AWS Lake Formation, administrators can set security policies using tag-based access control (TBAC). This allows you to tag tables as &amp;quot;Confidential&amp;quot; or &amp;quot;Public&amp;quot; and grant access to IAM users or roles based on these tags, which the Glue REST Catalog enforces dynamically for all connected engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Google Cloud BigQuery Managed REST Catalog&lt;/h3&gt;
&lt;p&gt;In April 2026, Google Cloud announced the preview of a managed Iceberg-compatible REST Catalog interface for BigQuery. This interface allows external query engines to read and write Iceberg metadata managed directly by Google Cloud&apos;s catalog service.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Implementation:&lt;/em&gt; The BigQuery catalog acts as the single source of truth for table schemas and snapshots. External engines (such as a Spark job running on Dataproc or an external Trino cluster) query the BigQuery REST endpoint to resolve table states. BigQuery handles the metadata updates and ensures that changes are reflected in BigQuery storage.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Workload Fit:&lt;/em&gt; This managed catalog provides read/write interoperability for hybrid platforms using Google Cloud storage. It allows teams to use BigQuery&apos;s storage speed while running specialized analytical workloads on external open-source engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;Feature Matrix: Comparing Access Control and Credential Vending&lt;/h2&gt;
&lt;p&gt;Selecting a catalog requires comparing their support for key platform features:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Apache Polaris (v1.4)&lt;/th&gt;
&lt;th&gt;Open-Source Unity Catalog&lt;/th&gt;
&lt;th&gt;Snowflake Open Catalog&lt;/th&gt;
&lt;th&gt;AWS Glue REST Catalog&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance Body&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache Software Foundation&lt;/td&gt;
&lt;td&gt;Linux Foundation (LF AI &amp;amp; Data)&lt;/td&gt;
&lt;td&gt;Snowflake (Managed Polaris)&lt;/td&gt;
&lt;td&gt;Proprietary (AWS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Formats&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache Iceberg&lt;/td&gt;
&lt;td&gt;Delta Lake, Iceberg, Hudi&lt;/td&gt;
&lt;td&gt;Apache Iceberg&lt;/td&gt;
&lt;td&gt;Apache Iceberg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Credential Vending&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native (AWS STS, GCP, Azure)&lt;/td&gt;
&lt;td&gt;Delta Sharing Protocol&lt;/td&gt;
&lt;td&gt;Native Serverless Vending&lt;/td&gt;
&lt;td&gt;Cloud IAM Integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access Control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Role-Based Access (RBAC)&lt;/td&gt;
&lt;td&gt;Lineage, Metric Semantics&lt;/td&gt;
&lt;td&gt;Snowflake Horizon (RBAC/FGAC)&lt;/td&gt;
&lt;td&gt;AWS Lake Formation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata Commits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST API Commit Protocol&lt;/td&gt;
&lt;td&gt;Catalog Commits Service&lt;/td&gt;
&lt;td&gt;REST API Commit Protocol&lt;/td&gt;
&lt;td&gt;AWS Glue Commit API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Billing Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (Open Source)&lt;/td&gt;
&lt;td&gt;Free (Open Source)&lt;/td&gt;
&lt;td&gt;Managed (Pay-per-request)&lt;/td&gt;
&lt;td&gt;Serverless (Glue request pricing)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;Decision Framework: Choosing the Right Catalog Control Plane&lt;/h2&gt;
&lt;p&gt;No single catalog control plane fits every data architecture. Your choice depends on your existing technology stack, cloud provider alignment, and format requirements.&lt;/p&gt;
&lt;p&gt;Use this engineering decision framework to guide your selection:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Are you a Databricks-centric shop running Delta Lake tables?&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Recommendation:&lt;/em&gt; Use &lt;strong&gt;Unity Catalog&lt;/strong&gt;. It provides native delta-commit logic, delta sharing, and lineage tracking. The open-source version allows you to integrate non-Databricks engines into your catalog namespace.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Are you building an open-source, multi-engine lakehouse utilizing Apache Iceberg?&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Recommendation:&lt;/em&gt; Use &lt;strong&gt;Apache Polaris&lt;/strong&gt;. Its vendor-neutral ASF governance, advanced AWS/GCP credential vending, and v1.4 updates (such as STS session tag auditing and metrics persistence) make it the optimal standard for open data platforms.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Do you use Snowflake as your primary query engine but want to query tables with external engines (like Spark or Flink) without vendor lock-in?&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Recommendation:&lt;/em&gt; Use &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt;. It provides a managed, serverless implementation of Polaris that integrates with Snowflake Horizon, allowing you to enforce row-level security and column masking across all query engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Are you hosted entirely inside a single cloud provider (AWS or Google Cloud) and want a managed, serverless catalog?&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Recommendation:&lt;/em&gt; Use &lt;strong&gt;AWS Glue REST Catalog&lt;/strong&gt; or &lt;strong&gt;BigQuery Managed REST Catalog&lt;/strong&gt;. These options integrate with cloud-native security and IAM policies, reducing operational setup overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For example, if your pipeline is built primarily on Spark and Trino running on AWS EMR, using self-hosted Apache Polaris or AWS Glue REST Catalog is the logical choice. However, if your business units query data using both Databricks and Snowflake, deploying a managed Snowflake Open Catalog or using Delta Sharing via Unity Catalog provides the necessary bridge to avoid data duplication.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/choosing-iceberg-control-plane/catalog-selection-flowchart.png&quot; alt=&quot;Flowchart decision tree helping engineers evaluate catalog requirements and select the correct Iceberg control plane&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The metadata control plane is the strategic layer of the modern data lakehouse. By decoupling query execution from metadata management, the Apache Iceberg REST Catalog specification enables true multi-engine interoperability and secure data access.&lt;/p&gt;
&lt;p&gt;Whether you deploy the open-source Apache Polaris server, open-source Unity Catalog, Snowflake Open Catalog, or cloud-native managed REST endpoints, establishing a centralized catalog is the critical step to scale your data platform and secure your metadata.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Master the Modern Lakehouse&lt;/h3&gt;
&lt;p&gt;To build your expertise in modern data architectures, open table formats, and semantic metadata design, consider the following next steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read the Lakehouse Guide:&lt;/strong&gt; Order a copy of &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt; for a detailed, hands-on exploration of building open data platforms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explore Other Technical Books:&lt;/strong&gt; Find listings of Alex Merced&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test Polaris Locally:&lt;/strong&gt; Run a local Apache Polaris instance using Docker and connect a local DuckDB or PySpark session to experiment with REST catalog commits and credential vending.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Try Dremio Cloud:&lt;/strong&gt; To query your Iceberg tables with sub-second performance, automated reflection tuning, and unified catalog integration, try Dremio Cloud free for 30 days at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Clean Rooms for Privacy-Preserving Analytics</title><link>https://iceberglakehouse.com/posts/2026-05-24-clean-rooms-privacy/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-clean-rooms-privacy/</guid><description>
Every organization that wants to collaborate on data faces the same tension. The analysis is valuable, matching your customer purchase history agains...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every organization that wants to collaborate on data faces the same tension. The analysis is valuable, matching your customer purchase history against a partner&apos;s ad impression data reveals attribution patterns that neither party could see alone. The data is sensitive, sharing raw customer records with an external party creates PII exposure risk, regulatory compliance problems, and the permanent problem of data copies that live outside your control.&lt;/p&gt;
&lt;p&gt;The historical solutions to this tension have been inadequate. You can share nothing, and lose the analytical value. You can share everything, and accept the compliance and security risks. You can negotiate a complex data contract that creates a one-time data copy under strict terms, and hope neither party violates them.&lt;/p&gt;
&lt;p&gt;Data clean rooms offer a fourth path. They create an isolated computational environment where both parties contribute data, queries run against the combined dataset inside the environment, and only aggregated, policy-filtered results leave. No raw row-level data from either party is ever accessible to the other.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Core Guarantee&lt;/h2&gt;
&lt;p&gt;The fundamental promise of a clean room is that neither party sees the other&apos;s individual records. This is enforced at the technical level, not just by contractual agreement.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/clean-rooms-privacy/data-clean-room-architecture.png&quot; alt=&quot;Clean room architecture diagram showing Party A and Party B data contributing via Delta Sharing to an isolated clean room environment with approved query templates, privacy budget, and policy engine, with differential privacy applied before aggregated output leaves&quot;&gt;&lt;/p&gt;
&lt;p&gt;The mechanics vary by platform, but the core model is consistent:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Both parties contribute datasets to the clean room environment through a secure data sharing mechanism (Delta Sharing, secure access links, or similar).&lt;/li&gt;
&lt;li&gt;The clean room enforces approved query templates, pre-defined SQL queries that analysts can parameterize but cannot modify in ways that would expose individual records.&lt;/li&gt;
&lt;li&gt;A privacy budget (often implemented via differential privacy) limits the total amount of information that can be extracted through repeated queries, preventing statistical re-identification attacks.&lt;/li&gt;
&lt;li&gt;Only aggregated, noise-added results leave the environment.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2&gt;Databricks Clean Rooms&lt;/h2&gt;
&lt;p&gt;Databricks Clean Rooms uses Delta Sharing as the underlying data access protocol. Each collaborating party shares specific tables into the clean room workspace using Delta Sharing&apos;s signed URL mechanism, the data remains in the contributor&apos;s storage, with access delegated to the clean room compute.&lt;/p&gt;
&lt;p&gt;The clean room administrator defines approved SQL queries as templates. A partner can parameterize these templates (filter by date range, product category, etc.) but cannot run arbitrary SQL that might expose individual rows. All query execution happens in the isolated clean room Databricks workspace, and only the query results leave.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Example: Creating a Databricks Clean Room collaboration
from databricks.sdk import WorkspaceClient

client = WorkspaceClient()

# Create the clean room
clean_room = client.clean_rooms.create(
    name=&amp;quot;partner_attribution_analysis&amp;quot;,
    remote_detailed_info={
        &amp;quot;collaborators&amp;quot;: [
            {&amp;quot;global_metastore_id&amp;quot;: &amp;quot;partner_metastore_id&amp;quot;,
             &amp;quot;invite_recipient_email&amp;quot;: &amp;quot;admin@partner.com&amp;quot;}
        ]
    }
)

# Define an approved output schema (only aggregations allowed)
# Partners can run: SELECT region, COUNT(*), SUM(revenue)
# grouped by their dimension attributes
# They cannot: SELECT customer_id, email, transaction_amount
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The separation is architectural. The partner&apos;s Databricks workspace never has credentials to read your underlying Delta Lake tables. Delta Sharing issues time-limited, scoped access tokens for the specific tables and operations the clean room requires.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;AWS Clean Rooms&lt;/h2&gt;
&lt;p&gt;AWS Clean Rooms provides a managed service that supports analysis across multiple parties&apos; data stored in S3, with optional Differential Privacy controls. Teams configure a collaboration in the AWS console, specify which tables from each party participate, and define analysis rules.&lt;/p&gt;
&lt;p&gt;Analysis rules in AWS Clean Rooms can be configured in three modes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Aggregation-only:&lt;/strong&gt; Queries must include &lt;code&gt;GROUP BY&lt;/code&gt; clauses and aggregation functions. No individual rows can be returned.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;List:&lt;/strong&gt; Allows returning a limited set of columns with required attributes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom:&lt;/strong&gt; Allows defining complex SQL with specific allowed functions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Differential Privacy feature in AWS Clean Rooms adds mathematically bounded noise to query results, providing formal privacy guarantees at the expense of some accuracy:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- AWS Clean Rooms query with differential privacy enabled
-- Results will have noise added based on configured epsilon value
SELECT
    campaign_id,
    COUNT(DISTINCT customer_id) AS attributed_customers,
    SUM(purchase_amount) AS total_attributed_revenue
FROM collaboration.matched_customers
GROUP BY campaign_id
HAVING COUNT(DISTINCT customer_id) &amp;gt;= 100;  -- Minimum count threshold enforced
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The minimum count threshold (&lt;code&gt;HAVING COUNT &amp;gt;= 100&lt;/code&gt;) prevents queries that isolate small groups from extracting information about individuals within those groups, even with noise addition.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;BigQuery Differential Privacy&lt;/h2&gt;
&lt;p&gt;BigQuery implements differential privacy natively in SQL through the &lt;code&gt;DIFFERENTIAL_PRIVACY&lt;/code&gt; clause, available in queries run against BigQuery datasets. This allows organizations to expose analytical views of sensitive datasets with formal privacy guarantees, without requiring a separate clean room environment.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- BigQuery differential privacy query
SELECT
    region,
    WITH DIFFERENTIAL_PRIVACY
        OPTIONS (epsilon = 1.0, delta = 1e-6, max_groups_contributed = 5)
        COUNT(DISTINCT user_id, contribution_bounds =&amp;gt; (0, 1)) AS unique_users,
        AVG(purchase_amount, contribution_bounds =&amp;gt; (0, 10000)) AS avg_purchase
FROM my_dataset.transactions
GROUP BY region;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;epsilon&lt;/code&gt; parameter (ε) controls the privacy-accuracy tradeoff. Smaller epsilon values add more noise, providing stronger privacy guarantees at the cost of result accuracy. The &lt;code&gt;delta&lt;/code&gt; parameter bounds the probability that the privacy guarantee fails. &lt;code&gt;max_groups_contributed&lt;/code&gt; limits how much any individual can affect the results by appearing in many groups.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Privacy Budget: The Finite Resource&lt;/h2&gt;
&lt;p&gt;Every query against a differentially private dataset consumes a portion of the privacy budget. The budget is a finite resource, once depleted, further queries expose more information about individuals than the privacy guarantee allows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/clean-rooms-privacy/privacy-budget-depletion.png&quot; alt=&quot;Privacy budget depletion chart showing remaining budget (ε) decreasing with each query from 2.0 down to 0, with warning threshold at 50% and stop threshold where queries are blocked&quot;&gt;&lt;/p&gt;
&lt;p&gt;Practical privacy budget management requires tracking consumption across all queries run against a protected dataset, alerting when the budget reaches warning thresholds, and either blocking further queries or refreshing the dataset (which resets the budget) when the budget is depleted.&lt;/p&gt;
&lt;p&gt;In production clean room environments, this means instrumenting query execution to track epsilon consumption and building budget management tooling that enforces limits before queries run.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Clean Rooms vs Direct Data Sharing&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/clean-rooms-privacy/clean-room-vs-direct-sharing.png&quot; alt=&quot;Side-by-side comparison showing direct data sharing risks (PII exposure, no audit trail, proliferating copies, compliance gap) versus data clean room benefits (no raw data exchanged, query audit log, privacy budget enforced, GDPR/CCPA compliant)&quot;&gt;&lt;/p&gt;
&lt;p&gt;The comparison isn&apos;t purely about privacy. Direct data sharing creates data governance problems that compound over time: copies multiply, access controls drift, and audit trails are incomplete. Clean rooms create a single, policy-enforced access point that maintains an audit log of every query run.&lt;/p&gt;
&lt;p&gt;For GDPR and CCPA compliance specifically, clean rooms provide a more defensible data processing arrangement than bilateral data transfers. The legal basis for processing partner data within a clean room (where the data never leaves the contributor&apos;s control and cannot be accessed by the collaborator), is cleaner than the legal basis for a data copy transferred to a partner&apos;s environment.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data clean rooms have moved from an enterprise niche (primarily advertising attribution measurement) to a general-purpose platform capability available in Databricks, AWS, and natively in BigQuery SQL. The technology is mature enough that most organizations with sensitive cross-party analysis needs can implement clean room collaboration without custom infrastructure.&lt;/p&gt;
&lt;p&gt;The governance discipline required is not primarily technical. It&apos;s about defining the right approved query templates, maintaining privacy budget controls, and treating clean room access as a governed capability with review processes for adding new queries to the approved template library.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Real-World Use Cases Beyond Ad Attribution&lt;/h2&gt;
&lt;p&gt;The media and advertising industry pioneered clean room adoption for campaign measurement. But the architecture is general-purpose, and 2025 saw adoption across several additional domains:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Healthcare collaboration.&lt;/strong&gt; Hospital networks combining patient outcomes data to improve treatment protocols, without sharing individual patient records across institutions. The clean room provides a HIPAA-compatible framework for multi-institution research that would otherwise require de-identification and data transfer agreements.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Financial services fraud detection.&lt;/strong&gt; Banks collaborating to identify cross-institution fraud patterns without sharing individual transaction records. A fraudster who moves money through multiple banks leaves a pattern visible only if the pattern can be detected in the combined dataset, which a clean room enables without raw data sharing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Retail and CPG supplier analysis.&lt;/strong&gt; Retailers analyzing category performance by combining their sales data with CPG manufacturers&apos; supply chain data. Neither party shares raw transaction records; the clean room environment computes joint metrics like out-of-stock correlation with competitor activity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Government statistics.&lt;/strong&gt; National statistics agencies combining census microdata with administrative records (tax, health, employment) to produce richer statistical outputs, with differential privacy applied to prevent re-identification of individuals in published statistics.&lt;/p&gt;
&lt;p&gt;In each case, the value of the combined dataset analysis exceeds the value of what either party can analyze independently, and the privacy-preserving architecture makes the collaboration legally and ethically feasible.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Legal Framework: Why Clean Rooms Simplify Compliance&lt;/h2&gt;
&lt;p&gt;The legal basis for cross-party data sharing under GDPR and CCPA depends significantly on how data flows between parties. A direct transfer of raw personal data from Party A to Party B typically requires:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A legal basis for the transfer (consent, legitimate interests, contract)&lt;/li&gt;
&lt;li&gt;A Data Processing Agreement (DPA) specifying how Party B handles the data&lt;/li&gt;
&lt;li&gt;Retention and deletion obligations for Party B&lt;/li&gt;
&lt;li&gt;Data subject access request obligations for Party B&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Clean rooms change this legal picture. When Party B never receives raw personal data (they only submit approved queries to an isolated environment and receive aggregated results), many of these obligations don&apos;t apply. Party B is effectively not a data controller or processor in the traditional sense; they&apos;re receiving statistical outputs, not personal data.&lt;/p&gt;
&lt;p&gt;This simplified legal basis makes the data sharing arrangement easier to approve through legal review and easier to audit for compliance. The clean room audit log provides documentary evidence that no individual records were transferred and that only approved query templates were executed.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Beyond Differential Privacy: Other Privacy-Preserving Techniques&lt;/h2&gt;
&lt;p&gt;Differential privacy is the most mathematically rigorous privacy technique and the one most commonly implemented in commercial clean room platforms. But it&apos;s not the only technique in the privacy engineering toolkit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Secure Multi-Party Computation (MPC):&lt;/strong&gt; Multiple parties jointly compute a function over their combined data without revealing their individual inputs to each other. MPC provides exact results (no noise addition) but has higher computational overhead than differential privacy. It&apos;s most practical for specific operations (intersection size calculation, machine learning on joint data) rather than general analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Federated Learning:&lt;/strong&gt; Model training that keeps data local to each party while only sharing model gradients. Each party trains on their local data, gradients are aggregated (with noise addition to protect individual contributions), and the updated model is distributed back without raw data movement. This is the approach used in Google&apos;s FL framework and Apple&apos;s on-device ML.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Synthetic Data Generation:&lt;/strong&gt; Creating statistically realistic synthetic datasets that preserve aggregate properties without containing actual individual records. Synthetic data can be shared freely because it doesn&apos;t represent real individuals. The limitation is that synthetic data quality degrades for rare subgroup analysis, the tail of the distribution is often poorly represented.&lt;/p&gt;
&lt;p&gt;For most enterprise cross-party analytics, differential privacy in a clean room environment provides the best balance of analytical utility and privacy guarantee. The other techniques are valuable for specific workloads where the operational overhead is justified.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Clean Room Adoption: Industry Use Cases&lt;/h2&gt;
&lt;p&gt;Clean room technology has seen practical adoption across several industries where the need for cross-party data analysis is high but data sharing is restricted by regulation, competitive concerns, or both.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advertising and media measurement.&lt;/strong&gt; The deprecation of third-party cookies has accelerated adoption of clean rooms for identity resolution and campaign measurement. An advertiser brings their first-party customer data; a publisher brings their audience data. The clean room computes match rates, reach and frequency metrics, and conversion attribution without either party seeing the other&apos;s raw user records. Google&apos;s Ads Data Hub, Amazon Marketing Cloud, and Meta&apos;s Advanced Analytics are all clean room products built for this use case.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Healthcare and life sciences.&lt;/strong&gt; Pharmaceutical companies and health systems share data to conduct post-market safety studies, generate real-world evidence for drug approvals, and identify patient cohorts for clinical trial recruitment. HIPAA&apos;s Safe Harbor and Expert Determination standards establish the baseline for de-identification, but clean rooms with differential privacy provide mathematically provable guarantees beyond de-identification alone.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Financial services.&lt;/strong&gt; Banks and financial institutions collaborate on fraud detection, money laundering detection, and credit risk modeling without sharing customer account data. The UK&apos;s Open Banking framework and the EU&apos;s PSD2 directive create a legal pathway for this kind of collaboration, and clean rooms provide the technical infrastructure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Retail and supply chain.&lt;/strong&gt; Retailers and consumer goods companies analyze category performance, promotional effectiveness, and inventory optimization using combined point-of-sale and supply chain data. The retailer&apos;s transaction data combined with the manufacturer&apos;s production and logistics data provides insights neither party can generate alone.&lt;/p&gt;
&lt;p&gt;Across all these use cases, the pattern is the same: two or more parties with valuable, sensitive datasets need to compute aggregate statistics that require combining their data, without exposing the underlying records. Clean rooms make this tractable where it was previously either legally or technically impossible.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Business Case for Privacy-Preserving Infrastructure&lt;/h2&gt;
&lt;p&gt;Privacy-preserving infrastructure represents a genuine competitive advantage for organizations that build it correctly. The ability to collaborate on data analysis without data exposure enables a class of business intelligence that competitors without clean room infrastructure can&apos;t access.&lt;/p&gt;
&lt;p&gt;For organizations that receive data from partners, the ability to offer clean room access as a product (rather than requiring partners to share raw data), reduces friction in data partnerships. Partners are more willing to share data under privacy-preserving guarantees because their risk exposure is lower. More partnership data means better models, better attribution, and better business decisions.&lt;/p&gt;
&lt;p&gt;The investment in differential privacy primitives and clean room infrastructure also serves the organization&apos;s internal governance. The privacy accounting techniques used in clean rooms (tracking how much information is revealed by each query), are directly applicable to internal privacy governance for customer data. Organizations that build clean room expertise develop internal capabilities that improve their handling of first-party customer data.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build Privacy-First Data Platforms&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on data governance, privacy-preserving architecture, and lakehouse design, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For governed multi-engine access to your Iceberg lakehouse with fine-grained column and row policies, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building Composable Query Engines with Rust Runtimes</title><link>https://iceberglakehouse.com/posts/2026-05-24-composable-query-engines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-composable-query-engines/</guid><description>
For most of data engineering history, a query engine was a monolithic system. You picked a database or warehouse, and it owned everything from the SQ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;For most of data engineering history, a query engine was a monolithic system. You picked a database or warehouse, and it owned everything from the SQL parser through the disk I/O layer. The engine choice was also your storage choice, your catalog choice, and often your governance choice. Composability (the ability to mix and match components from different systems), was minimal.&lt;/p&gt;
&lt;p&gt;That design is being dismantled. Apache DataFusion provides an embeddable, modular query execution engine written in Rust. Meta&apos;s Velox provides a high-performance C++ execution kernel that plugs into Presto, Spark, and other systems. Substrait provides a cross-language plan representation format that lets query plans flow between different engines without recompilation or reparse. Apache Arrow provides the in-memory columnar format that eliminates serialization overhead when data moves between components.&lt;/p&gt;
&lt;p&gt;Together, these four projects define a stack where you can build a query engine the way you build a web application, assembling purpose-fit components rather than accepting a single vendor&apos;s implementation decisions at every layer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Problem with Monolithic Engines&lt;/h2&gt;
&lt;p&gt;A monolithic query engine owns too much. Its SQL parser is tightly coupled to its catalog protocol. Its physical execution layer assumes specific memory management patterns. Its storage I/O uses proprietary file access abstractions. To add a new data source, you often need to implement a connector interface that is specific to that engine&apos;s internal API.&lt;/p&gt;
&lt;p&gt;This creates two expensive problems. First, every query engine team must solve the same problems: vectorized execution, predicate pushdown, partition pruning, join ordering. The implementations differ in detail but duplicate enormous amounts of engineering. Second, interoperability between engines requires serializing data to an intermediate format (usually Parquet files or Avro on S3), rather than sharing computation directly.&lt;/p&gt;
&lt;p&gt;The composable stack addresses both by separating concerns into standardized layers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Composable Stack: Four Layers&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/composable-query-engines/composable-query-engine-stack.png&quot; alt=&quot;Layered architecture diagram showing composable query engine stack with application interface at top, parser and optimizer in second layer, Substrait plan exchange in the middle, DataFusion Rust and Velox C++ executors in fourth layer, Apache Arrow IPC below, and object store/Parquet/Iceberg at the bottom&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 1: The Query Interface.&lt;/strong&gt; The application presents queries as SQL strings or DataFrame API calls. This layer handles user-facing concerns: parse, validate column references, resolve types. It produces a logical plan, a tree of relational operators describing what to compute, not how.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 2: Optimization.&lt;/strong&gt; The optimizer transforms the logical plan into a physical plan. This is where join ordering, partition pruning, predicate pushdown, and scan selection happen. The optimizer is where most engine-specific intelligence lives. DataFusion implements a pluggable optimizer pipeline where custom rules can be inserted at each optimization pass.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 3: Plan Exchange via Substrait.&lt;/strong&gt; Substrait is a protobuf-based specification for relational algebra. A physical plan can be serialized to Substrait format and deserialized by a different engine. This enables query federation: part of a query can be executed by DataFusion (Rust), and another part can be offloaded to Velox (C++) or DuckDB, with the plan boundary expressed in Substrait.&lt;/p&gt;
&lt;p&gt;DataFusion supports Substrait as both a producer (it can serialize its physical plans to Substrait) and a consumer (it can accept Substrait plans from other systems and execute them). Velox supports Substrait as a consumer, meaning it can receive plans from DataFusion, Spark (via the Gluten plugin), or other producers and execute them using its C++ kernel.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 4: Execution.&lt;/strong&gt; The execution layer reads data, applies operators, and produces results. Both DataFusion and Velox use vectorized, columnar execution: data flows through the operator pipeline as batches of Arrow-format columns rather than row-by-row. This is the architecture that enables cache-friendly SIMD operations and high throughput on modern hardware.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The memory layer: Apache Arrow IPC.&lt;/strong&gt; Arrow&apos;s Inter-Process Communication format allows data to pass between processes (or components in the same process) as raw memory pointers to columnar buffers. No serialization, no copying. When a DataFusion component passes a batch to a Velox component in the same process, the data doesn&apos;t move at all, only the pointer does.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Apache DataFusion: The Rust-Native Embedded Engine&lt;/h2&gt;
&lt;p&gt;DataFusion is the component you choose when you&apos;re building a new data system in Rust and need a query engine that you can customize at every level. It&apos;s not a database you deploy, it&apos;s a library you embed.&lt;/p&gt;
&lt;p&gt;The key design properties:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pluggable table providers.&lt;/strong&gt; DataFusion&apos;s &lt;code&gt;TableProvider&lt;/code&gt; trait defines the interface for registering a data source. Any implementation of &lt;code&gt;TableProvider&lt;/code&gt; can be registered as a SQL table. This is how Iceberg support, Delta Lake support, and custom blob store readers plug into DataFusion-based systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pluggable execution operators.&lt;/strong&gt; The &lt;code&gt;ExecutionPlan&lt;/code&gt; trait defines the interface for a physical operator. Custom operators (specialized aggregation functions, ML inference operators, custom join algorithms), can be inserted into the execution pipeline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Optimizer rule extensibility.&lt;/strong&gt; The optimizer runs a sequence of rule passes. Custom optimizer rules can be added to the pipeline to implement engine-specific optimizations that the default DataFusion implementation doesn&apos;t include.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;// Minimal DataFusion example: registering an Iceberg table and running SQL
use datafusion::prelude::*;
use iceberg_datafusion::IcebergTableProvider;

#[tokio::main]
async fn main() -&amp;gt; datafusion::error::Result&amp;lt;()&amp;gt; {
    let ctx = SessionContext::new();

    // Register an Iceberg table as a DataFusion source
    let iceberg_provider = IcebergTableProvider::try_new(
        &amp;quot;s3://my-bucket/iceberg/events/&amp;quot;
    ).await?;

    ctx.register_table(&amp;quot;events&amp;quot;, Arc::new(iceberg_provider))?;

    // Query using standard SQL
    let df = ctx.sql(
        &amp;quot;SELECT region, COUNT(*) as cnt FROM events WHERE event_date = &apos;2025-05-24&apos; GROUP BY region&amp;quot;
    ).await?;

    df.show().await?;
    Ok(())
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;DataFusion has real production users. dbt Fusion uses DataFusion for SQL compilation and plan analysis. InfluxDB IOx uses it as the query engine for InfluxDB&apos;s column-store backend. The Ballista distributed query engine uses DataFusion as its single-node execution layer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Meta Velox: The C++ Execution Kernel&lt;/h2&gt;
&lt;p&gt;Velox is Meta&apos;s contribution to the composable stack. Where DataFusion targets teams building new systems from scratch in Rust, Velox targets teams with existing C++ or JVM-based systems who want a high-performance execution kernel without rewriting everything.&lt;/p&gt;
&lt;p&gt;Velox integrates as a native execution plugin for Presto at Meta and is available to the open-source community as a Spark accelerator through the Gluten project. When Gluten is used, Spark&apos;s logical plan is compiled to Velox&apos;s internal plan representation, and the actual computation executes in C++ rather than JVM bytecode. Benchmarks from Gluten-enabled Spark clusters show substantial throughput improvements for CPU-bound aggregation and join workloads.&lt;/p&gt;
&lt;p&gt;Velox also accepts Substrait plans, which means it can interoperate with DataFusion-produced plans for cross-system execution.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;DataFusion vs Velox: Choosing Your Foundation&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/composable-query-engines/datafusion-vs-velox-comparison.png&quot; alt=&quot;Side-by-side comparison of Apache DataFusion and Meta Velox across language, API, catalog, Substrait support, known users, and best-fit scenarios&quot;&gt;&lt;/p&gt;
&lt;p&gt;The choice between DataFusion and Velox is primarily a language and integration question, not a performance question. Both execute on Arrow-format batches with vectorized operations. Both are competitive on analytical workloads.&lt;/p&gt;
&lt;p&gt;Choose DataFusion if your team is building a new data system in Rust, you want memory safety guarantees at the execution layer, you need to embed a query engine in a library (not a service), or you want first-class Substrait producer support.&lt;/p&gt;
&lt;p&gt;Choose Velox if you&apos;re adding an acceleration layer to an existing JVM-based system (specifically Spark via Gluten), your team&apos;s core expertise is in C++, or you&apos;re operating within Meta&apos;s Presto ecosystem.&lt;/p&gt;
&lt;p&gt;For most new data platform projects in 2026, DataFusion is the default choice. The Rust ecosystem&apos;s library-first design, combined with DataFusion&apos;s extensive trait-based extensibility, makes it easier to build a new system on top of DataFusion than to integrate Velox into a stack that wasn&apos;t designed around it from the start.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What This Means for Platform Engineers&lt;/h2&gt;
&lt;p&gt;The composable runtime stack doesn&apos;t require you to write a query engine to be useful. The practical implications are:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Understand what your query tools are built on.&lt;/strong&gt; When you&apos;re evaluating DuckDB (built on its own C++ engine), dbt Fusion (built on DataFusion), or a custom Rust data tool, knowing whether it uses DataFusion or Velox tells you about extensibility, memory characteristics, and interoperability potential.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Substrait enables federation without ETL.&lt;/strong&gt; If you need to route part of a query to one engine and part to another (for example, reading from an Iceberg table via DataFusion and passing results to a GPU acceleration layer) Substrait is the format that makes this possible without intermediate file writes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Arrow eliminates serialization overhead.&lt;/strong&gt; If two components in your pipeline both support Apache Arrow IPC, you can pass data between them without serialization. This is especially relevant for ML pipelines where query results feed directly into model inference.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The composable query engine stack (DataFusion for Rust-native execution, Velox for C++ execution, Substrait for plan portability, Arrow for zero-copy memory), represents a genuine architectural shift in how data systems are built. Monolithic engines are not disappearing, but the ability to assemble a custom engine from well-defined, independently-maintained components is now practical rather than theoretical.&lt;/p&gt;
&lt;p&gt;For teams building new data infrastructure, DataFusion is the most productive starting point. It&apos;s a mature library, actively maintained under the Apache Software Foundation, with production use cases that prove its execution model at scale.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Building an Embedded Analytics Engine with DataFusion&lt;/h2&gt;
&lt;p&gt;One of the most compelling use cases for DataFusion is embedding it directly in applications that need analytical query capability without the overhead of a separate service. This pattern is increasingly common for multi-tenant SaaS applications where each tenant needs SQL analytics against their own dataset.&lt;/p&gt;
&lt;p&gt;A minimal embedded analytics API built on DataFusion:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;use actix_web::{web, App, HttpServer, HttpResponse};
use datafusion::prelude::*;
use serde::{Deserialize, Serialize};
use std::sync::Arc;
use tokio::sync::RwLock;

// Shared DataFusion SessionContext (one per tenant in production)
type SharedContext = Arc&amp;lt;RwLock&amp;lt;SessionContext&amp;gt;&amp;gt;;

#[derive(Deserialize)]
struct QueryRequest {
    sql: String,
    tenant_id: String,
}

#[derive(Serialize)]
struct QueryResponse {
    rows: Vec&amp;lt;serde_json::Value&amp;gt;,
    row_count: usize,
    execution_time_ms: u64,
}

async fn execute_query(
    ctx: web::Data&amp;lt;SharedContext&amp;gt;,
    query: web::Json&amp;lt;QueryRequest&amp;gt;,
) -&amp;gt; HttpResponse {
    let start = std::time::Instant::now();

    // Get or create tenant context
    let session = ctx.read().await;

    // Execute SQL query
    match session.sql(&amp;amp;query.sql).await {
        Ok(df) =&amp;gt; {
            let batches = df.collect().await.unwrap_or_default();
            let rows = arrow_to_json(&amp;amp;batches);
            let elapsed = start.elapsed().as_millis() as u64;

            HttpResponse::Ok().json(QueryResponse {
                row_count: rows.len(),
                rows,
                execution_time_ms: elapsed,
            })
        }
        Err(e) =&amp;gt; HttpResponse::BadRequest().body(e.to_string()),
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This embedded pattern means the query engine starts and stops with the application process, scales horizontally with the application, and has zero network overhead for query execution, data stays in the process address space.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Ballista: DataFusion for Distributed Workloads&lt;/h2&gt;
&lt;p&gt;When single-node DataFusion reaches its limits (roughly when the dataset doesn&apos;t fit on one machine), Ballista extends DataFusion to a distributed execution model. Ballista uses the same physical plan representation as single-node DataFusion, but distributes plan fragments across worker nodes.&lt;/p&gt;
&lt;p&gt;The development workflow is the same: write DataFusion queries locally, test on small data, deploy to a Ballista cluster for large-scale execution. The API difference is creating a &lt;code&gt;BallistaContext&lt;/code&gt; instead of a &lt;code&gt;SessionContext&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;use ballista::prelude::*;

#[tokio::main]
async fn main() -&amp;gt; Result&amp;lt;()&amp;gt; {
    // Connect to Ballista scheduler
    let ctx = BallistaContext::remote(&amp;quot;localhost&amp;quot;, 50050, &amp;amp;BallistaConfig::new()).await?;

    // Register data sources - same API as local DataFusion
    ctx.register_parquet(&amp;quot;events&amp;quot;, &amp;quot;s3://my-bucket/events/**/*.parquet&amp;quot;, ParquetReadOptions::default()).await?;

    // Query executes distributed
    let df = ctx.sql(&amp;quot;SELECT date, SUM(amount) FROM events GROUP BY date ORDER BY date&amp;quot;).await?;
    df.show().await?;

    Ok(())
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ballista is less mature than DataFusion itself and is still catching up to the DataFusion API surface. But for teams that are already building on DataFusion and need to scale out, Ballista provides a path that doesn&apos;t require adopting Spark.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Practical Impact on Query Performance&lt;/h2&gt;
&lt;p&gt;The composable stack&apos;s performance advantages are most visible in workloads that previously required data movement between systems.&lt;/p&gt;
&lt;p&gt;In a traditional architecture, a query that joins a Postgres table with an Iceberg table with a Redis lookup might require:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Export Postgres data to S3 as Parquet&lt;/li&gt;
&lt;li&gt;Load S3 Parquet into Spark&lt;/li&gt;
&lt;li&gt;Load Redis data into Spark&lt;/li&gt;
&lt;li&gt;Run the join in Spark&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With DataFusion&apos;s pluggable table providers, all three sources can be registered as tables in the same SessionContext, and the join executes in a single Arrow-native pass with no intermediate files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- All three sources queried in one statement, no data movement
SELECT
    u.user_id,
    u.email,
    e.purchase_count,
    r.loyalty_tier
FROM postgres_users u
JOIN iceberg_events e ON u.user_id = e.user_id
JOIN redis_loyalty r ON u.user_id = r.user_id
WHERE e.event_date = &apos;2025-05-24&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query processes data from Postgres, Iceberg, and Redis without materializing any intermediate dataset to disk. The Arrow IPC format enables zero-copy data passing between the table providers and the join executor. For queries that frequently need cross-source joins, the performance improvement over traditional ETL-then-join workflows is substantial.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Composable Data Ecosystem in Practice&lt;/h2&gt;
&lt;p&gt;The composable data engineering paradigm is not just about query engines. It extends across the entire data stack through a set of shared standards that enable independent components to interoperate.&lt;/p&gt;
&lt;p&gt;Apache Iceberg and Delta Lake provide the open table format layer, a standard way to represent tables that any engine can read. Apache Arrow provides the in-memory columnar format that engines use to exchange data without serialization overhead. Substrait provides a standard representation of query plans that different engines can exchange. These three specifications together make the &amp;quot;composable&amp;quot; vision concrete: separate storage, compute, and catalog components that can be mixed and matched without vendor lock-in.&lt;/p&gt;
&lt;p&gt;The practical result is that a team can build an architecture where Kafka ingests data to Iceberg via Flink, Spark performs complex transformations, DuckDB runs ad-hoc analyst queries, and Dremio serves BI tool SQL, all against the same underlying Iceberg tables, with no data movement between components.&lt;/p&gt;
&lt;p&gt;This composability also means that adopting a new tool doesn&apos;t require rebuilding the data stack. When LanceDB&apos;s multimodal capabilities became valuable for an ML team&apos;s embedding workload, they added it alongside the existing Iceberg infrastructure rather than replacing it. When DataFusion&apos;s embedded engine use case emerged for a lightweight API service, it could read the same Iceberg tables as the rest of the stack. Each new tool plugs in to the existing data layer through the open format.&lt;/p&gt;
&lt;p&gt;The organizational implication is significant. Composable architectures allow different teams to choose the query engine that fits their workload and skill set without creating data silos. The ML team uses Python and Polars. The analytics team uses DuckDB and SQL. The data platform team uses Spark for heavy transformation. All three teams share the same Iceberg tables. Coordination happens through data agreements and schema contracts, not through shared infrastructure choices.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Choosing a Query Engine: Decision Framework&lt;/h2&gt;
&lt;p&gt;Teams evaluating composable query engine options benefit from a structured decision framework rather than a feature comparison matrix. The right engine depends on the workload pattern, team skills, deployment environment, and operational constraints, not on which engine wins the TPC-DS benchmark on a specific hardware configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Start with the access pattern.&lt;/strong&gt; Single-process SQL on datasets under 100 GB that fit on disk: DuckDB. Python-centric DataFrame transformations with ML integration: Polars. Complex stateful streaming with event-time semantics: Flink. Large-scale batch transformations with wide ecosystem support: Spark. Embedded query execution in an application or API: DataFusion. No single engine covers all of these patterns optimally.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consider the operational context.&lt;/strong&gt; A team running Kubernetes already knows how to operate distributed JVM services; Spark or Flink is a natural fit. A team building serverless Python functions won&apos;t want to manage a Spark cluster. A startup with two data engineers can&apos;t afford the operational overhead of Milvus plus Kafka plus Spark, simpler tools that cover the same ground with less infrastructure are more appropriate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Account for ecosystem integrations.&lt;/strong&gt; Spark has the deepest catalog and connector integrations of any query engine in the open-source ecosystem. DuckDB has the fastest growing integration surface for ad-hoc analytics. DataFusion&apos;s Rust-native execution makes it the best choice when query execution must be embedded in a non-JVM service. Choose based on what already exists in your stack.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build Deeper Expertise&lt;/h3&gt;
&lt;p&gt;For a comprehensive treatment of modern data architecture patterns, open table formats, and composable systems, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s query engine uses Apache Arrow as its core data format and is designed for multi-engine lakehouse access. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Mesh After the Hype: What Actually Works</title><link>https://iceberglakehouse.com/posts/2026-05-24-data-mesh-after-hype/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-data-mesh-after-hype/</guid><description>
When Zhamak Dehghani published the original data mesh papers at Thoughtworks in 2019 and 2020, the response split sharply between organizations that ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;When Zhamak Dehghani published the original data mesh papers at Thoughtworks in 2019 and 2020, the response split sharply between organizations that saw it as a fundamental rethinking of data platform architecture and skeptics who viewed it as a repackaging of existing domain-driven design concepts applied to data teams.&lt;/p&gt;
&lt;p&gt;Both groups were partially right. The conceptual insight in data mesh (that the bottleneck in enterprise data platforms is organizational, not technical, and that treating data as a product published by domain teams addresses scaling problems that no amount of centralized engineering can solve), was valuable and largely correct. The implementation turned out to be significantly harder and more context-dependent than the original framing suggested.&lt;/p&gt;
&lt;p&gt;Three years of production data mesh implementations across organizations of various sizes have produced a clearer picture of what works, what doesn&apos;t, and where &amp;quot;data product thinking&amp;quot; delivers value without requiring a full organizational reorganization.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Four Principles, Revisited&lt;/h2&gt;
&lt;p&gt;Data mesh&apos;s four core principles are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Domain ownership:&lt;/strong&gt; Teams own their data, end-to-end, from ingestion through publication&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data as a product:&lt;/strong&gt; Domains publish data products with explicit quality contracts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Self-serve platform:&lt;/strong&gt; Infrastructure for building, publishing, and consuming data products is shared&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Federated governance:&lt;/strong&gt; Policy enforcement is distributed but consistent&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In practice, most organizations that have successfully adopted data mesh patterns have implemented principles 2 and 3 first, without requiring full domain ownership (principle 1) or complex federated governance mechanisms (principle 4). This partial adoption has delivered real value without the organizational disruption of a full mesh topology.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What Domain Ownership Actually Requires&lt;/h2&gt;
&lt;p&gt;Full domain ownership (where a domain team handles their own ingestion, transformation, quality monitoring, and publication), requires those teams to have (or develop) data engineering competency. For organizations where data engineering is scarce, asking a sales team or a product team to also manage their Spark jobs and Iceberg table maintenance is unrealistic.&lt;/p&gt;
&lt;p&gt;The organizations that have made domain ownership work share two characteristics: they have a strong self-serve data platform (principle 3 genuinely delivers on its promise, so domain teams aren&apos;t managing infrastructure directly), and they have embedded or dedicated data engineers within domain teams who handle the technical implementation.&lt;/p&gt;
&lt;p&gt;For organizations without these conditions, &amp;quot;domain ownership&amp;quot; typically degrades to &amp;quot;domain teams declare what data they want published&amp;quot; while a central data engineering team does the actual implementation. This is a useful organizational pattern, but it&apos;s not what the original mesh architecture describes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Data Products: The Most Adoptable Principle&lt;/h2&gt;
&lt;p&gt;The most universally valuable data mesh concept is treating datasets as products. A data product has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Defined ownership (a team, with named contacts)&lt;/li&gt;
&lt;li&gt;Documented schema and semantics (what does each column mean?)&lt;/li&gt;
&lt;li&gt;SLA commitments (freshness, availability, quality thresholds)&lt;/li&gt;
&lt;li&gt;Access controls (who can read what)&lt;/li&gt;
&lt;li&gt;Discovery metadata (searchable in a catalog)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This discipline (treating published datasets as products rather than pipeline outputs), improves data quality and consumer trust regardless of whether the organization adopts the full mesh topology.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/data-mesh-after-hype/data-mesh-domain-ownership-topology.png&quot; alt=&quot;Data product thinking domain ownership model showing federated governance at top providing policies and standards to four domain teams (Sales, Marketing, Finance, Product), each publishing data products with SLA badges, consuming the self-serve data platform for capabilities, and sharing data products cross-domain&quot;&gt;&lt;/p&gt;
&lt;p&gt;The key governance artifact for data products is a quality contract:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# data_product_contract.yaml
name: customer_360
owner: &amp;quot;data-engineering-platform@company.com&amp;quot;
domain: &amp;quot;customer&amp;quot;
version: &amp;quot;2.3.0&amp;quot;

schema:
  - field: customer_id
    type: STRING
    description: &amp;quot;Unique customer identifier (UUID format)&amp;quot;
    nullable: false
  - field: lifetime_value
    type: DECIMAL(18,2)
    description: &amp;quot;Cumulative purchase value in USD since account creation&amp;quot;
    nullable: true

quality_sla:
  freshness_minutes: 60 # Updated at most 60 minutes ago
  completeness_threshold: 0.99 # 99% of expected records present
  null_rate_threshold: # Column-level null rate limits
    customer_id: 0.0
    lifetime_value: 0.05

access_control:
  default_access: &amp;quot;INTERNAL&amp;quot;
  readers:
    - role: &amp;quot;analyst&amp;quot;
      filter: &amp;quot;region = current_user_attribute(&apos;region&apos;)&amp;quot;
    - role: &amp;quot;data_scientist&amp;quot;
      filter: null # Full access

discovery:
  tags: [&amp;quot;customer&amp;quot;, &amp;quot;crm&amp;quot;, &amp;quot;pii-contains&amp;quot;]
  lineage: &amp;quot;sourced from salesforce_sync + product_events&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When a data product has a contract like this, consumers know what they&apos;re getting. Engineers publishing the product have measurable SLAs to maintain. Monitoring tools can alert when the product violates its contract. This is a significant improvement over the typical enterprise data catalog experience where datasets exist but quality commitments are informal at best.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Bottleneck Data Mesh Actually Solves&lt;/h2&gt;
&lt;p&gt;The organizational problem data mesh addresses is the centralized data team bottleneck. When all data engineering work (ingestion, transformation, quality monitoring, publication), flows through a central team, that team becomes a bottleneck. Domain teams wait weeks for data pipeline requests. The central team lacks domain context and builds suboptimal transformations. Priority conflicts between domains are constant.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/data-mesh-after-hype/data-mesh-product-thinking-maturity.png&quot; alt=&quot;Data mesh maturity progression from centralized ETL teams at the bottom through domain ownership awareness, data products as output, federated governance, to pragmatic data mesh at the top, with Thoughtworks noting Steps 3-4 can be implemented without the full sequence&quot;&gt;&lt;/p&gt;
&lt;p&gt;Domain ownership addresses this by moving pipeline ownership to the teams with the domain context. A finance team that owns their own data pipelines can prioritize, design, and maintain those pipelines without waiting for a central team&apos;s ticket queue. The tradeoff is distributed responsibility, the central platform team maintains the self-serve infrastructure, while domain teams maintain their pipelines.&lt;/p&gt;
&lt;p&gt;For organizations with 50+ data pipelines spanning 10+ domains and a central data engineering team perpetually backlogged, the mesh topology is worth the organizational investment. For organizations with 10 pipelines and a data engineering team of five, a well-run centralized team with good domain partnership is more practical.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Self-Serve Platform: The Hard Part&lt;/h2&gt;
&lt;p&gt;Data mesh&apos;s least-examined principle (and the one that most implementations underinvest in), is the self-serve platform. For domain ownership to work without requiring each domain to have full-stack data engineering expertise, the platform must make it easy to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ingest data from common sources (Salesforce, Postgres, Kafka) without writing Spark jobs&lt;/li&gt;
&lt;li&gt;Transform data using SQL with managed orchestration (dbt on Airflow)&lt;/li&gt;
&lt;li&gt;Publish data products with automatic quality monitoring&lt;/li&gt;
&lt;li&gt;Register products in a searchable catalog with governance metadata&lt;/li&gt;
&lt;li&gt;Monitor pipeline health without deep infrastructure knowledge&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Building this platform is significant engineering work. Organizations that adopt domain ownership without investing in the self-serve platform discover that domain teams default to doing things the way they&apos;ve always done them (ad-hoc pipelines, no quality contracts, no catalog registration), because the &amp;quot;easy path&amp;quot; doesn&apos;t exist.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Pragmatic Data Product Thinking&lt;/h2&gt;
&lt;p&gt;The practical takeaway from three years of data mesh implementation experience is that the product mindset is the most valuable element, and it can be adopted incrementally:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start with ownership:&lt;/strong&gt; Every dataset has a named owner who is accountable for its quality and freshness.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add contracts:&lt;/strong&gt; Each published dataset has a written quality contract with freshness and completeness SLAs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build discoverability:&lt;/strong&gt; Datasets are registered in a catalog with enough metadata for consumers to find and evaluate them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enforce governance:&lt;/strong&gt; Access controls, audit logs, and lineage tracking are automatic, not manual.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This progression delivers data product value without requiring the full organizational topology change of a complete mesh. Teams that have never experienced domain ownership can start with ownership accountability and contracts, build the discipline, and expand from there.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data mesh as a philosophy (treat data as a product, distribute ownership to domain teams, invest in self-serve infrastructure, federate governance), has proven valuable in organizations with the right scale and organizational conditions. As a rigid implementation requirement, it has proven impractical for smaller organizations and those without self-serve infrastructure investment.&lt;/p&gt;
&lt;p&gt;The durable insight is data product thinking: explicit ownership, quality contracts, discoverability, and governance. These disciplines improve data platform reliability and consumer trust regardless of whether the organization adopts a full mesh topology.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Common Failure Modes in Data Mesh Implementations&lt;/h2&gt;
&lt;p&gt;Organizations that attempted full data mesh adoption and struggled share common patterns. Understanding these failure modes helps set realistic expectations and avoid the most costly mistakes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Failure Mode 1: Adopting the topology without the platform.&lt;/strong&gt; Domain ownership requires domain teams to have the tools to build, test, and publish data pipelines without central team support. When organizations announce domain ownership without first building the self-serve platform, domain teams either revert to requesting help from the central team (recreating the bottleneck) or build ad-hoc pipelines without quality controls (creating new technical debt).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Failure Mode 2: Data product contracts without enforcement.&lt;/strong&gt; Writing a quality contract is easy. Monitoring compliance with the contract and alerting when products violate their SLAs requires tooling investment. Organizations that create contract documentation but don&apos;t build automated monitoring discover that contracts become stale and untrustworthy, undermining consumer confidence in the entire data product catalog.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Failure Mode 3: Federated governance without standards.&lt;/strong&gt; Federated governance means each domain sets their own policies within organizational bounds. Without clear organizational standards (what tags are required, what sensitivity classifications exist, what the audit log format is), domain policies diverge. Cross-domain data product consumption becomes complicated by inconsistent access patterns and incompatible metadata schemas.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Failure Mode 4: Mesh topology without domain data engineering capacity.&lt;/strong&gt; The hardest organizational constraint is finding engineers who have both data engineering skills and deep domain knowledge. Most organizations don&apos;t have enough of these people, and training existing engineers takes time. Rushing domain ownership before teams have the technical capacity produces low-quality pipelines with no quality monitoring.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Data Catalog as Connective Tissue&lt;/h2&gt;
&lt;p&gt;A searchable, accurate data catalog is the organizational glue that makes federated data products usable. Without it, data products exist in isolated silos, the finance team knows about the finance data products, the marketing team knows about their products, but cross-domain discovery requires personal relationships rather than tooling.&lt;/p&gt;
&lt;p&gt;Modern data catalogs like Datahub, Alation, Atlan, and Apache Atlas provide:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic search:&lt;/strong&gt; Search for data products by business concept, not technical names. &amp;quot;Find data products containing customer lifetime value&amp;quot; should return relevant datasets without requiring knowledge of how the finance team named their &lt;code&gt;clv_90d&lt;/code&gt; column.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lineage visualization:&lt;/strong&gt; Click on a data product in the catalog and see its upstream sources (which pipelines write to it) and downstream consumers (which dashboards, models, and pipelines read from it). This cross-domain lineage view is only possible if OpenLineage events from each domain&apos;s pipelines flow to the shared catalog.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ownership and contact information:&lt;/strong&gt; Every catalog entry shows the owning team and a contact mechanism. When a consumer discovers that a data product&apos;s quality has degraded, they know immediately who to contact.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quality signals:&lt;/strong&gt; The catalog integrates data quality monitoring alerts, showing consumers whether a data product currently meets its SLA. A red quality indicator on a catalog entry tells consumers not to rely on the product until the owner resolves the underlying issue.&lt;/p&gt;
&lt;p&gt;Investing in the catalog before scaling domain ownership is one of the highest-leverage decisions an organization can make. It creates the shared vocabulary and discoverability infrastructure that makes cross-domain data product consumption possible.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Measuring Data Mesh Success&lt;/h2&gt;
&lt;p&gt;The organizational investment in data mesh should be measured against concrete outcomes. Useful metrics for evaluating data mesh maturity and effectiveness:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data product SLA compliance rate:&lt;/strong&gt; What percentage of data products meet their freshness and completeness SLAs? A well-functioning mesh should maintain &amp;gt;95% compliance across products. Declining compliance rates indicate either inadequate monitoring, insufficient domain team capacity, or over-committed SLAs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time to data product:&lt;/strong&gt; How long does it take from &amp;quot;this domain needs a new data product&amp;quot; to &amp;quot;the product is published and available to consumers&amp;quot;? In a centralized team model, this is often measured in weeks (waiting for engineering capacity). In a functioning mesh, it should be days for typical products.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-domain consumption rate:&lt;/strong&gt; How many data products are consumed by domains other than their producer? Low cross-domain consumption suggests the catalog isn&apos;t surfacing relevant products, or that consumers don&apos;t trust products from other domains. High cross-domain consumption is evidence that the mesh is creating organizational value beyond siloed domain analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unattributed pipeline ownership:&lt;/strong&gt; The percentage of active pipelines without a named owner. As organizations scale, unmaintained pipelines accumulate. A mesh governance discipline should keep this near zero; every pipeline has an owner, and pipeline removal is a deliberate process.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build a Modern Data Platform&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on data governance, lakehouse architecture, and agentic AI integration, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides unified query access to your lakehouse data products with governance and performance. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How dbt Fusion Reshapes Analytics Engineering</title><link>https://iceberglakehouse.com/posts/2026-05-24-dbt-fusion-analytics-engineering/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-dbt-fusion-analytics-engineering/</guid><description>
The dbt Core engine that analytics engineering teams have relied on since 2017 was built in Python at a time when the job of the tool was to template...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The dbt Core engine that analytics engineering teams have relied on since 2017 was built in Python at a time when the job of the tool was to template SQL and run it against a warehouse. It worked well for that job. It also inherited the constraints of a text-template system: SQL was a string to be rendered, not code to be analyzed. The engine had no understanding of column references, type compatibility, or cross-model dependencies beyond the explicit &lt;code&gt;ref()&lt;/code&gt; calls that connected models in the DAG.&lt;/p&gt;
&lt;p&gt;dbt Fusion, launched as a public beta on May 28, 2025, is a ground-up rewrite of the dbt execution engine in Rust. It isn&apos;t a version update or a performance patch, it&apos;s a different execution model. SQL is now treated as an abstract syntax tree (AST) that the engine understands statically, before any query reaches the warehouse. The downstream effects of this architectural change touch everything from local development experience to CI pipeline cost.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Python Era: SQL as Text&lt;/h2&gt;
&lt;p&gt;In dbt Core, a model like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- models/fct_revenue.sql
select
    o.order_id,
    o.customer_id,
    c.region,
    o.amount as revenue
from {{ ref(&apos;stg_orders&apos;) }} o
join {{ ref(&apos;stg_customers&apos;) }} c on o.customer_id = c.id
where o.status = &apos;completed&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;is processed by a Jinja2 templating engine that substitutes &lt;code&gt;{{ ref(&apos;stg_orders&apos;) }}&lt;/code&gt; with the correct table name for the current target environment. The resulting SQL string is sent to the warehouse for execution. The Python process that manages this rendering has no understanding of SQL syntax; it can&apos;t tell you whether &lt;code&gt;o.customer_id&lt;/code&gt; and &lt;code&gt;c.id&lt;/code&gt; have compatible types, or whether &lt;code&gt;amount&lt;/code&gt; exists as a column in &lt;code&gt;stg_orders&lt;/code&gt;, without actually running the query.&lt;/p&gt;
&lt;p&gt;This means errors surface at runtime, after paying for warehouse execution. For a large project with hundreds of models, discovering that a renamed column broke three downstream models requires running the full pipeline, paying for compute, waiting for results, and only then seeing which models failed.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;dbt Fusion: SQL as First-Class Code&lt;/h2&gt;
&lt;p&gt;Fusion replaces the Jinja2-over-text approach with a genuine SQL compiler. The engine parses SQL into an AST, resolves column references across model dependencies, performs type checking, and reports errors locally, before any query reaches the warehouse.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/dbt-fusion-analytics-engineering/dbt-fusion-rust-lifecycle.png&quot; alt=&quot;dbt Core vs dbt Fusion lifecycle comparison showing Python text templates versus Rust AST compilation, with 30x faster project parsing and error detection shifting from runtime to author time&quot;&gt;&lt;/p&gt;
&lt;p&gt;What this enables:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Real-time error detection in VS Code.&lt;/strong&gt; The Fusion engine powers a Language Server Protocol (LSP) implementation. The official dbt VS Code extension uses this to underline type errors, unresolved column references, and dialect incompatibilities as you type, the same experience TypeScript developers have had for years. Analytics engineers no longer need to submit a job to the warehouse to find out if a column rename broke downstream models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column-aware autocomplete.&lt;/strong&gt; Because Fusion understands the schema of each model in the project, it can suggest valid column names in joins and &lt;code&gt;WHERE&lt;/code&gt; clauses. This eliminates a class of typo-induced bugs that previously required runtime discovery.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;30x faster project parsing.&lt;/strong&gt; dbt Labs reported up to 30x faster project parsing compared to dbt Core. For large projects with hundreds of models, this transforms the iteration loop. Test a single model change in seconds rather than waiting for a full project scan.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Zero Python dependency.&lt;/strong&gt; Fusion ships as a standalone Rust binary with no Python runtime requirement. This simplifies CI/CD pipeline setup (no virtual environment management), containerization (smaller images), and deployment to environments where managing Python versions is an operational burden.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;State-Aware Orchestration: The Cost Story&lt;/h2&gt;
&lt;p&gt;The most operationally significant Fusion feature for production environments is state-aware orchestration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/dbt-fusion-analytics-engineering/dbt-fusion-state-aware-orchestration.png&quot; alt=&quot;dbt Fusion state-aware orchestration DAG comparison showing full rebuild executing all 6 models versus state-aware rebuild executing only 2 affected models with lower warehouse costs&quot;&gt;&lt;/p&gt;
&lt;p&gt;In dbt Core, running &lt;code&gt;dbt build&lt;/code&gt; triggers execution of every model in the project, or every model in a selected subset. If only &lt;code&gt;fct_orders.sql&lt;/code&gt; changed, the run still typically executes all downstream models to ensure consistency: &lt;code&gt;fct_orders&lt;/code&gt;, &lt;code&gt;dim_customers&lt;/code&gt;, &lt;code&gt;mart_revenue&lt;/code&gt;, &lt;code&gt;mart_churn&lt;/code&gt;. This costs warehouse compute for models whose logic didn&apos;t change.&lt;/p&gt;
&lt;p&gt;State-aware orchestration means Fusion tracks which models have actually changed (by diffing the compiled SQL AST, not the source file) and which upstream datasets have new data. It executes only the models that are affected by the change, not the entire downstream graph. In a project with hundreds of models, this can reduce CI run time and warehouse compute cost by an order of magnitude for common change patterns like updating a single staging model.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Run only models affected by changes since the last successful run
dbt build --select state:modified+
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;What Doesn&apos;t Change&lt;/h2&gt;
&lt;p&gt;Fusion maintains the dbt authoring layer that analytics engineers already know. SQL files, YAML schema definitions, &lt;code&gt;ref()&lt;/code&gt; and &lt;code&gt;source()&lt;/code&gt; functions, Jinja macros; these all work the same way. Teams migrating from dbt Core don&apos;t rewrite their models. They install the Fusion binary and change the runtime.&lt;/p&gt;
&lt;p&gt;Adapter macro compatibility is the primary migration concern. Fusion&apos;s Rust core handles SQL parsing and compilation, but database-specific adapter macros (the code that translates generic dbt operations into warehouse-specific SQL) still use Python. Teams with heavily customized macros may encounter compatibility issues during migration that require testing before moving production environments to Fusion.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Development Workflow in Practice&lt;/h2&gt;
&lt;p&gt;The practical change for an analytics engineer&apos;s daily workflow looks like this:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Before Fusion:&lt;/strong&gt; Write SQL, run &lt;code&gt;dbt compile&lt;/code&gt; to check for Jinja errors, run &lt;code&gt;dbt run --select my_model&lt;/code&gt; against dev warehouse, check output, iterate. Each iteration requires a warehouse round-trip.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;With Fusion:&lt;/strong&gt; Write SQL, get real-time syntax and column error highlighting in VS Code without leaving the editor, run &lt;code&gt;dbt run --select my_model&lt;/code&gt; to validate end-to-end results. The first warehouse round-trip happens later in the loop; after local validation has already caught most errors.&lt;/p&gt;
&lt;p&gt;For teams running CI on every pull request, the state-aware rebuild eliminates full-project rebuild costs for targeted changes. A PR that updates one staging model no longer triggers a full project rebuild; it triggers only the affected downstream models.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;dbt Fusion is the biggest change to the dbt ecosystem since the introduction of the Semantic Layer. It resolves a design tension that has been present since dbt&apos;s origins: SQL is a typed, structured language being processed by a system that treated it as unstructured text.&lt;/p&gt;
&lt;p&gt;The Rust rewrite and static AST analysis make the feedback loop tighter, CI pipelines cheaper, and error discovery earlier. Teams still need to test Fusion compatibility with their specific adapter macros and warehouse configurations. But for the majority of dbt projects using standard patterns, Fusion represents a meaningful improvement to the analytics engineering experience.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The dbt Semantic Layer and MetricFlow&lt;/h2&gt;
&lt;p&gt;Alongside Fusion&apos;s execution changes, the dbt Semantic Layer has matured into a production-ready component for teams that want a governed metric layer above their warehouse models.&lt;/p&gt;
&lt;p&gt;MetricFlow (the SQL generation engine behind the dbt Semantic Layer), defines metrics as composable objects with defined dimensions, filters, and measures. A metric defined once in MetricFlow can be queried consistently across any downstream tool (Tableau, Looker, Mode, custom applications) without each tool reimplementing the aggregation logic.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# models/metrics/fct_revenue.yml
metrics:
  - name: total_revenue
    label: Total Revenue
    description: Gross revenue from completed orders
    type: simple
    type_params:
      measure: revenue_amount
    filter: |
      {{ Dimension(&apos;status&apos;) }} = &apos;completed&apos;
    dimensions:
      - name: region
        type: categorical
      - name: order_date
        type: time
        type_params:
          time_granularity: day
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once defined, this metric is queryable through the dbt Semantic Layer API, with MetricFlow automatically generating the appropriate SQL for the target warehouse:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query the semantic layer from a Python application
from dbt_semantic_interfaces.query_interface import SemanticLayerClient

client = SemanticLayerClient(
    environment_id=&amp;quot;your-env-id&amp;quot;,
    auth_token=&amp;quot;your-token&amp;quot;,
    host=&amp;quot;semantic-layer.cloud.getdbt.com&amp;quot;
)

# MetricFlow generates correct SQL automatically
results = client.query(
    metrics=[&amp;quot;total_revenue&amp;quot;],
    group_by=[&amp;quot;region&amp;quot;, &amp;quot;order_date&amp;quot;],
    where=&amp;quot;order_date &amp;gt;= &apos;2025-01-01&apos;&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the governed alternative to every BI tool writing its own revenue calculation SQL, MetricFlow ensures that &amp;quot;total revenue&amp;quot; means the same thing regardless of which tool is asking the question.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;dbt Fusion with Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The combination of dbt Fusion and Apache Iceberg Iceberg tables as dbt model targets is a configuration that several data teams have adopted for lakehouse analytics engineering.&lt;/p&gt;
&lt;p&gt;When dbt models write to Iceberg tables through adapters that support Iceberg (dbt-spark, dbt-trino, dbt-glue, and the newer dbt-iceberg experimental adapter), the benefits of Iceberg&apos;s table format (ACID transactions, schema evolution, time travel), apply to dbt model outputs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Incremental models with Iceberg:&lt;/strong&gt; Iceberg&apos;s merge-on-read and copy-on-write strategies map naturally to dbt&apos;s incremental materialization strategies. A dbt incremental model that appends new rows uses Iceberg&apos;s ACID append. A model that upserts uses Iceberg&apos;s MERGE statement support.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- dbt incremental model targeting an Iceberg table
{{ config(
    materialized=&apos;incremental&apos;,
    unique_key=&apos;order_id&apos;,
    on_schema_change=&apos;merge&apos;,
    file_format=&apos;iceberg&apos;,
    incremental_strategy=&apos;merge&apos;
) }}

SELECT
    order_id,
    customer_id,
    amount,
    status,
    updated_at
FROM {{ ref(&apos;stg_orders&apos;) }}
{% if is_incremental() %}
WHERE updated_at &amp;gt; (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Schema evolution without rebuilds:&lt;/strong&gt; Iceberg&apos;s schema evolution means adding a column to a dbt model doesn&apos;t require dropping and recreating the table. The new column is added to the Iceberg schema metadata, existing data files remain untouched, and the new column shows as NULL for historical rows until backfilled.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Testing Strategies for dbt Projects&lt;/h2&gt;
&lt;p&gt;dbt&apos;s native testing framework has expanded in 2025 to include more sophisticated data quality checks alongside the standard singular and generic tests.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Generic tests&lt;/strong&gt; check universal properties: &lt;code&gt;not_null&lt;/code&gt;, &lt;code&gt;unique&lt;/code&gt;, &lt;code&gt;accepted_values&lt;/code&gt;, &lt;code&gt;relationships&lt;/code&gt;. These should cover every model&apos;s primary key, every foreign key relationship, and every column with a fixed set of valid values.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# schema.yml: comprehensive testing for a fact table
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref(&apos;dim_customers&apos;)
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values:
                [&amp;quot;pending&amp;quot;, &amp;quot;processing&amp;quot;, &amp;quot;completed&amp;quot;, &amp;quot;cancelled&amp;quot;, &amp;quot;refunded&amp;quot;]
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Singular tests&lt;/strong&gt; express custom business logic that generic tests can&apos;t capture:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- tests/assert_revenue_positive.sql
-- Passes if result set is empty (no failing rows)
SELECT
    order_id,
    amount,
    &apos;Expected positive revenue for completed orders&apos; AS failure_reason
FROM {{ ref(&apos;fct_orders&apos;) }}
WHERE status = &apos;completed&apos;
  AND amount &amp;lt;= 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running the full test suite as part of CI with Fusion&apos;s state-aware execution means only tests for affected models run on each PR, dramatically reducing CI time for targeted changes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Analytics Engineering Role in 2026&lt;/h2&gt;
&lt;p&gt;The tooling improvements in Fusion and the maturation of the dbt Semantic Layer have changed what it means to be an analytics engineer. Early dbt practitioners spent significant time debugging Jinja macro behavior, writing workarounds for SQL-as-string limitations, and waiting for CI pipelines to complete. The technical friction was constant.&lt;/p&gt;
&lt;p&gt;With Fusion, the development experience more closely resembles software engineering. Real-time error feedback in the IDE, fast local compilation, and state-aware CI runs change the feedback loop. The time between &amp;quot;I made a change&amp;quot; and &amp;quot;I know whether the change is correct&amp;quot; shrinks from minutes to seconds for most common changes.&lt;/p&gt;
&lt;p&gt;This shift frees analytics engineering time for higher-value work: designing better data models, defining metrics with precision in MetricFlow, improving test coverage, and documenting datasets so that downstream consumers (including AI assistants querying the semantic layer), can use them correctly.&lt;/p&gt;
&lt;p&gt;The semantic layer&apos;s role in this shift is particularly significant for AI use cases. A well-designed MetricFlow metric definition is not just useful for Tableau dashboards, it&apos;s the definition that an AI agent queries when it answers &amp;quot;what was total revenue this quarter?&amp;quot; If the metric is defined correctly in MetricFlow, the AI answer is grounded in the same calculation logic that powers every other downstream tool. If revenue logic is scattered across BI tool calculations and SQL transforms, AI answers will be inconsistent with the numbers analysts see in dashboards.&lt;/p&gt;
&lt;p&gt;Analytics engineering discipline (defining metrics in one place, testing every model, documenting every column), has always been valuable. In the AI-assisted analytics environment of 2026, it&apos;s load-bearing infrastructure.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;dbt Deployment Best Practices: Environments and Promotion&lt;/h2&gt;
&lt;p&gt;A production-grade dbt deployment requires at least three environments: development, staging, and production. Each environment has its own target database or schema, and models are promoted from development through staging to production after passing validation gates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Development environment:&lt;/strong&gt; Each data engineer works in their own schema namespace. Fusion&apos;s state-aware CI only builds models affected by the current branch&apos;s changes, so developers get fast feedback without building the entire project. The development environment uses a limited dataset (either sample data or a subset of production), to keep build times fast.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Staging environment:&lt;/strong&gt; This is a full-scale environment that mirrors production data. CI runs against staging after every pull request merge to the main branch. Staging is where integration tests run, verifying that models produce expected row counts, that relationships between models hold, and that no source freshness violations exist.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Production environment:&lt;/strong&gt; Production runs on a schedule (typically every few hours for batch analytical models) and receives models only after they pass the full staging validation suite. Production dbt runs should emit lineage events (to OpenLineage or the catalog) and alert on failures through PagerDuty or Slack.&lt;/p&gt;
&lt;p&gt;The Fusion toolchain&apos;s partial parsing capability makes multi-environment deployments faster. When a model&apos;s upstream dependencies haven&apos;t changed, Fusion skips re-parsing those models during the compile step. For large dbt projects with hundreds of models, this reduces CI compile times from minutes to seconds for typical branch changes.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Go Further with Data Engineering&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on building reliable, governed data platforms, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For federated analytics with query acceleration across your dbt-modeled data, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using DuckDB and Polars to Query Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2026-05-24-duckdb-polars-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-duckdb-polars-iceberg/</guid><description>
Two years ago, DuckDB and Polars were single-process analytical tools with limited lakehouse integration. You could read Parquet files from S3 using ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Two years ago, DuckDB and Polars were single-process analytical tools with limited lakehouse integration. You could read Parquet files from S3 using either, but writing to a catalog-managed Iceberg table required Spark or Flink. That constraint has been removed.&lt;/p&gt;
&lt;p&gt;DuckDB 1.4 LTS, released in September 2025, shipped with Iceberg write support. Polars extended its streaming engine&apos;s sink capabilities to include Iceberg tables in 2026. Both tools now offer a complete read-write path to Iceberg tables managed by REST Catalogs like Apache Polaris, Nessie, and Amazon S3 Tables. DuckDB went further: by December 2025, the DuckDB-Wasm build included the Iceberg extension, enabling browser-based read and write access to Iceberg REST Catalogs with no backend server.&lt;/p&gt;
&lt;p&gt;This post covers what&apos;s actually different about these two tools and how to integrate both into a lakehouse workflow.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;DuckDB and Iceberg: What Changed in 1.4&lt;/h2&gt;
&lt;p&gt;DuckDB has supported reading Iceberg tables since earlier releases through its &lt;code&gt;iceberg&lt;/code&gt; extension. The 1.4 LTS release added write capability: INSERT operations create new Parquet files and commit new snapshots to the Iceberg catalog. The 1.4.2 patch extended this to DELETE and UPDATE operations, implemented using positional deletes (merge-on-read semantics).&lt;/p&gt;
&lt;p&gt;To connect to an Iceberg table through a REST Catalog:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Install and load the Iceberg extension
INSTALL iceberg;
LOAD iceberg;

-- Configure a REST catalog connection
CREATE SECRET iceberg_catalog (
    TYPE iceberg_rest,
    ENDPOINT &apos;https://my-polaris-catalog.example.com/api/catalog&apos;,
    CREDENTIAL &apos;Bearer my-oauth-token&apos;
);

-- Attach the catalog
ATTACH &apos;my_namespace&apos; AS my_lake (TYPE iceberg_rest, SECRET &apos;iceberg_catalog&apos;);

-- Query a table
SELECT * FROM my_lake.events WHERE event_date = &apos;2025-05-24&apos;;

-- Write to a table
INSERT INTO my_lake.events
SELECT * FROM read_parquet(&apos;s3://staging/events-2025-05-24/*.parquet&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The write constraint to be aware of: DuckDB implements updates and deletes using positional deletes rather than full row rewrites (copy-on-write). For tables receiving heavy mutation loads, this means delete files accumulate between compaction runs, the same issue described earlier for Iceberg V2 CDC pipelines. For append-heavy analytical tables where DuckDB&apos;s primary use case lies, this is a non-issue.&lt;/p&gt;
&lt;p&gt;DuckDB-Wasm&apos;s Iceberg integration is more architecturally novel. The browser build uses JavaScript&apos;s Fetch API to handle HTTP requests, meaning DuckDB-Wasm can communicate with Iceberg REST Catalog endpoints directly from a browser tab. This enables analytics dashboards and data exploration tools that run entirely client-side, with the browser reading Iceberg table metadata and Parquet data from S3 directly, without any server-side query layer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Polars and Iceberg: The Streaming Sink&lt;/h2&gt;
&lt;p&gt;Polars approaches Iceberg differently. Rather than offering a full SQL-level catalog integration, Polars&apos; Iceberg support is centered on its LazyFrame and streaming engine.&lt;/p&gt;
&lt;p&gt;For reading:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl

# Read an Iceberg table as a lazy frame
lf = pl.scan_iceberg(&amp;quot;s3://my-bucket/iceberg/events/&amp;quot;)

# Apply transformations lazily
result = lf.filter(
    pl.col(&amp;quot;event_date&amp;quot;) == &amp;quot;2025-05-24&amp;quot;
).select(
    [&amp;quot;user_id&amp;quot;, &amp;quot;event_type&amp;quot;, &amp;quot;amount&amp;quot;]
).sort(&amp;quot;amount&amp;quot;, descending=True)

# Collect (execute) locally
df = result.collect()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For writing through the streaming engine, Polars uses sink operations that allow it to process larger-than-memory datasets:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl

# Stream-process a large dataset and sink to Iceberg
(
    pl.scan_parquet(&amp;quot;s3://staging/raw-events/**/*.parquet&amp;quot;)
    .filter(pl.col(&amp;quot;event_type&amp;quot;).is_in([&amp;quot;purchase&amp;quot;, &amp;quot;signup&amp;quot;]))
    .with_columns([
        pl.col(&amp;quot;ts&amp;quot;).dt.date().alias(&amp;quot;event_date&amp;quot;)
    ])
    .sink_iceberg(
        &amp;quot;s3://my-bucket/iceberg/events/&amp;quot;,
        mode=&amp;quot;append&amp;quot;
    )
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The streaming sink processes input in chunks, writing Parquet files incrementally rather than accumulating everything in memory before writing. This makes Polars a practical ETL engine for medium-scale data movement workloads where data exceeds available RAM but doesn&apos;t require the cluster-level parallelism of Spark or Flink.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Polars Cloud: From Local to Distributed&lt;/h2&gt;
&lt;p&gt;The major Polars development in 2025 is Polars Cloud, which extends local Polars execution to managed cloud infrastructure without requiring code changes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/duckdb-polars-iceberg/duckdb-polars-local-to-remote-workflow.png&quot; alt=&quot;DuckDB and Polars local-to-remote execution workflow showing local development with both tools, decision gate on data size and multi-user needs, and remote execution via Polars Cloud or MotherDuck&quot;&gt;&lt;/p&gt;
&lt;p&gt;The pattern is a &lt;code&gt;ComputeContext&lt;/code&gt; that describes the cloud resources, combined with &lt;code&gt;.remote(ctx)&lt;/code&gt; chained onto a &lt;code&gt;LazyFrame&lt;/code&gt;. The same Polars code that runs locally against a sample dataset runs on cloud infrastructure against the full dataset by swapping the execution context:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl
from polars_cloud import ComputeContext

# Define a cloud compute context
ctx = ComputeContext(
    provider=&amp;quot;aws&amp;quot;,
    cpu=64,
    memory_gb=256,
    region=&amp;quot;us-east-1&amp;quot;
)

# Same LazyFrame code as local development
result = (
    pl.scan_iceberg(&amp;quot;s3://my-data-lake/iceberg/events/&amp;quot;)
    .filter(pl.col(&amp;quot;revenue&amp;quot;) &amp;gt; 1000)
    .group_by(&amp;quot;region&amp;quot;)
    .agg(pl.sum(&amp;quot;revenue&amp;quot;))
    .remote(ctx)  # Execute remotely
    .collect()
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Polars Cloud also supports a distributed engine in open beta, which enables horizontal scaling across multiple machines for queries that don&apos;t fit a single node&apos;s memory even with streaming. The distributed engine automatically partitions work across worker nodes for aggregations and joins.&lt;/p&gt;
&lt;p&gt;MotherDuck provides an analogous capability for DuckDB: cloud-executed DuckDB with a hybrid execution model that can run part of a query locally and part remotely, optimizing network data movement for analytical queries against remote Iceberg tables.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Feature Comparison&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/duckdb-polars-iceberg/duckdb-vs-polars-feature-comparison.png&quot; alt=&quot;Feature comparison table showing DuckDB and Polars across query interface, language, Iceberg support, scale-up strategy, distributed option, best use case, and WASM browser support&quot;&gt;&lt;/p&gt;
&lt;p&gt;The tools serve complementary rather than competing use cases. DuckDB is the right choice when your team speaks SQL, when you need embedded analytics in an application, or when you want to explore Iceberg data from a notebook or browser without managing server infrastructure.&lt;/p&gt;
&lt;p&gt;Polars is the right choice when your primary artifacts are Python pipeline code, when you&apos;re building ML preprocessing pipelines that need to chain DataFrame operations with scikit-learn or PyTorch, or when you want a Rust-native execution engine with guaranteed memory safety properties.&lt;/p&gt;
&lt;p&gt;Both now support Iceberg as a first-class data store, which means you can build a lakehouse workflow where data lands in Iceberg via Flink or Spark ingestion, is queried and explored via DuckDB for ad-hoc analysis, and processed through Polars for feature engineering and ML training set generation, all using the same Iceberg table as the shared source of truth.&lt;/p&gt;
&lt;h2&gt;The Development Workflow: Local to Lakehouse&lt;/h2&gt;
&lt;p&gt;One of the most underappreciated aspects of both DuckDB and Polars is their role in the development workflow itself. Before a pipeline runs in production on Spark or Flink, engineers need to develop and test it against real data at a manageable scale. Both tools excel here.&lt;/p&gt;
&lt;p&gt;A common pattern is a staged development approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Local exploration (DuckDB):&lt;/strong&gt; Use DuckDB to explore the raw data, understand schemas, identify data quality issues, and prototype the transformations needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pipeline development (Polars):&lt;/strong&gt; Implement the transformation logic in Polars. Test it against a sample of the production data on a local machine.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scale verification (Polars Cloud or MotherDuck):&lt;/strong&gt; Run the same code against the full production dataset on cloud infrastructure, without rewriting the pipeline for Spark.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Production deployment:&lt;/strong&gt; If the dataset grows to Spark/Flink scale, the Polars LazyFrame API provides clear semantics that map reasonably well to PySpark DataFrame operations, making migration manageable.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This workflow eliminates a significant development cost: the local development loop for Spark pipelines requires either running a local Spark cluster (expensive to set up and maintain) or submitting jobs to a remote cluster (slow iteration cycles). With DuckDB and Polars, the local development loop runs in seconds rather than minutes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;DuckDB for Embedded Analytics and Browser Applications&lt;/h2&gt;
&lt;p&gt;DuckDB&apos;s embedding capabilities go well beyond notebook analytics. As a library that can be embedded in applications, DuckDB enables analytics patterns that were previously impractical.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Application-embedded analytics:&lt;/strong&gt; A Python web application can embed DuckDB and run complex aggregation queries against user-specific datasets without external service dependencies. This pattern is particularly useful for multi-tenant SaaS applications where each tenant&apos;s data is small enough to query locally but complex enough to require proper SQL analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Browser-based analytics with DuckDB-Wasm:&lt;/strong&gt; The DuckDB-Wasm build, now with Iceberg extension support, enables analytics dashboards that run entirely in the browser. User data is loaded from S3 directly, and DuckDB executes analytical queries client-side. This eliminates the server-side query infrastructure for many dashboard use cases.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-javascript&quot;&gt;// Browser-based DuckDB-Wasm with Iceberg support
import * as duckdb from &amp;quot;@duckdb/duckdb-wasm&amp;quot;;

const db = await duckdb.createDuckDB({
  query: { castTimestampToDate: true },
});
const conn = await db.connect();

// Load the Iceberg extension
await conn.query(&amp;quot;INSTALL iceberg; LOAD iceberg;&amp;quot;);

// Configure catalog access
await conn.query(`
    CREATE SECRET iceberg_catalog (
        TYPE iceberg_rest,
        ENDPOINT &apos;https://catalog.example.com/api/catalog&apos;,
        CREDENTIAL &apos;Bearer ${userToken}&apos;
    );
`);

// Query directly from browser : no server required
const result = await conn.query(`
    SELECT region, SUM(amount) as total_revenue
    FROM iceberg_catalog.main.orders
    WHERE event_date &amp;gt;= &apos;2025-01-01&apos;
    GROUP BY region
    ORDER BY total_revenue DESC
`);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This client-side analytics pattern has real performance advantages for interactive dashboards. Users get sub-second query responses for exploratory analytics without waiting for a centralized query service to process their request.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Practical Patterns: DuckDB for Data Quality Profiling&lt;/h2&gt;
&lt;p&gt;One area where DuckDB shines specifically is data quality profiling during ingestion validation. Before writing to an Iceberg table, you can run statistical profiling queries in DuckDB to validate the incoming data meets quality thresholds:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Profile incoming data before writing to Iceberg
WITH stats AS (
    SELECT
        COUNT(*) AS total_rows,
        COUNT(*) FILTER (WHERE user_id IS NULL) AS null_user_ids,
        COUNT(*) FILTER (WHERE amount &amp;lt; 0) AS negative_amounts,
        MIN(event_date) AS earliest_date,
        MAX(event_date) AS latest_date,
        COUNT(DISTINCT user_id) AS unique_users
    FROM read_parquet(&apos;s3://staging/incoming/*.parquet&apos;)
)
SELECT
    total_rows,
    (null_user_ids::FLOAT / total_rows) AS null_rate,
    negative_amounts,
    earliest_date,
    latest_date,
    unique_users,
    CASE
        WHEN null_user_ids::FLOAT / total_rows &amp;gt; 0.01 THEN &apos;FAIL: null rate &amp;gt; 1%&apos;
        WHEN negative_amounts &amp;gt; 0 THEN &apos;FAIL: negative amounts found&apos;
        WHEN latest_date &amp;gt; CURRENT_DATE THEN &apos;FAIL: future dates found&apos;
        ELSE &apos;PASS&apos;
    END AS quality_check
FROM stats;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This lightweight profiling step, running in seconds on DuckDB before an Iceberg write, catches data quality issues that would otherwise corrupt the production table and require an expensive rollback and re-ingest.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Polars for ML Feature Preprocessing&lt;/h2&gt;
&lt;p&gt;Polars&apos; expression API is particularly well-suited for the feature engineering that precedes model training. The lazy evaluation model means you can define a complex feature pipeline and execute it efficiently in a single pass over the data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl

# Define a feature engineering pipeline for a churn model
feature_pipeline = (
    pl.scan_iceberg(&amp;quot;s3://data-lake/iceberg/user_events/&amp;quot;)
    .filter(pl.col(&amp;quot;event_date&amp;quot;) &amp;gt;= pl.lit(&amp;quot;2024-01-01&amp;quot;))
    .with_columns([
        # Recency: days since last purchase
        (pl.lit(&amp;quot;2025-05-24&amp;quot;).str.to_date() - pl.col(&amp;quot;last_purchase_date&amp;quot;))
        .dt.total_days()
        .alias(&amp;quot;days_since_purchase&amp;quot;),

        # Frequency: purchases in last 30 days
        pl.col(&amp;quot;purchase_count_30d&amp;quot;).alias(&amp;quot;frequency&amp;quot;),

        # Monetary: average purchase value
        (pl.col(&amp;quot;total_spend_90d&amp;quot;) / pl.col(&amp;quot;purchase_count_90d&amp;quot;))
        .fill_nan(0.0)
        .alias(&amp;quot;avg_purchase_value&amp;quot;),

        # Engagement: session count last 7 days
        pl.col(&amp;quot;session_count_7d&amp;quot;).alias(&amp;quot;engagement&amp;quot;),
    ])
    .select([&amp;quot;user_id&amp;quot;, &amp;quot;days_since_purchase&amp;quot;, &amp;quot;frequency&amp;quot;, &amp;quot;avg_purchase_value&amp;quot;, &amp;quot;engagement&amp;quot;, &amp;quot;is_churned&amp;quot;])
    # Write features to training dataset
    .sink_iceberg(&amp;quot;s3://data-lake/iceberg/churn_features/&amp;quot;, mode=&amp;quot;overwrite&amp;quot;)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same pipeline runs locally against a sample for development and at full scale via Polars Cloud for production. No Spark job code, no cluster management, just Python and Polars.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;DuckDB-Wasm: Browser-Native Analytics Without a Backend&lt;/h2&gt;
&lt;p&gt;One of the more surprising directions in the DuckDB ecosystem is its WebAssembly (Wasm) build, a version of DuckDB that runs entirely in the browser without any server component.&lt;/p&gt;
&lt;p&gt;DuckDB-Wasm allows a web application to execute SQL queries against Parquet files or Iceberg tables stored in object storage directly from the user&apos;s browser. The query engine runs in a Web Worker (keeping the UI thread responsive), and results render in the browser without any data passing through a backend API. For analytics dashboards, internal reporting tools, and embedded BI use cases, this architecture eliminates the per-query compute cost and reduces infrastructure to just an object storage bucket.&lt;/p&gt;
&lt;p&gt;The practical limitation is that DuckDB-Wasm operates within browser memory constraints; typically 1-4 GB depending on the browser and device. For datasets that fit in memory, it&apos;s fast. For datasets that don&apos;t, the query must be restructured to use streaming or partitioned reads. DuckDB&apos;s Iceberg support in the Wasm build is still developing as of mid-2025, but the trajectory is toward full parity with the native binary&apos;s Iceberg capabilities.&lt;/p&gt;
&lt;p&gt;Several open-source observability and BI tools are already built on DuckDB-Wasm: Evidence, Observable Framework, and Rill all use DuckDB as their embedded query engine. The pattern of &amp;quot;ship the query engine with the application, not the data&amp;quot; is becoming a standard architecture for lightweight analytics tools.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;When to Scale Up: Recognizing the Limits of Single-Engine Processing&lt;/h2&gt;
&lt;p&gt;DuckDB and Polars are remarkable tools, but knowing when they&apos;ve reached their limits is as important as knowing how to use them.&lt;/p&gt;
&lt;p&gt;The practical signals that a workload has outgrown single-process analytics:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query runtime exceeds operator patience.&lt;/strong&gt; If a DuckDB query takes more than 10-15 minutes, analysts stop waiting for results and start working around the tool. The threshold varies by use case, but slow iteration cycles destroy the value proposition of local analytics tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Memory exhaustion.&lt;/strong&gt; DuckDB spills to disk for out-of-memory conditions, but disk-backed operations are dramatically slower than in-memory ones. If a query consistently requires disk spill, it&apos;s consuming more I/O than a distributed system would use compute.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data size exceeds what fits in reasonable cloud storage in a single query path.&lt;/strong&gt; When the input data for a transformation is multiple terabytes, DuckDB and Polars&apos; sequential scan (even with parallel execution) can&apos;t match the parallelism of a distributed Spark or Dremio query that reads hundreds of partitions simultaneously from dozens of executors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Team size creates contention.&lt;/strong&gt; DuckDB and Polars run in-process. When a team of 20 analysts all need to run queries simultaneously, a shared distributed warehouse (Redshift, Snowflake, Dremio), provides resource isolation and fair scheduling that single-process tools can&apos;t.&lt;/p&gt;
&lt;p&gt;The transition from local analytics to distributed infrastructure is not a failure of the local tools. It&apos;s a success signal, the platform has grown to the scale where distributed compute investment pays off. DuckDB and Polars remain valuable at that scale too, in their appropriate roles: DuckDB for developer-local exploration and testing, Polars for Python-based feature engineering pipelines that run as Kubernetes jobs, and both as components in larger orchestrated workflows.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The ecosystem has converged on Iceberg as the shared table format that connects different processing tools. DuckDB&apos;s 1.4 LTS and DuckDB-Wasm Iceberg support, combined with Polars&apos; streaming sink and Polars Cloud, complete the path from data exploration to cloud-scale execution using the same open table format.&lt;/p&gt;
&lt;p&gt;The practical guidance: use DuckDB for SQL-centric exploration, ad-hoc analytics, data quality profiling, embedded analytics, and browser applications. Use Polars for Python pipeline code that transforms and moves data, particularly in data science and ML feature engineering workflows. Neither requires you to spin up a Spark cluster for tasks at the scale where a well-tuned single process or small cloud cluster handles the job.&lt;/p&gt;
&lt;p&gt;Both tools share a commitment to Apache Arrow as their in-memory columnar format. This means data can be passed between DuckDB and Polars without serialization overhead, a DuckDB query result becomes a Polars DataFrame directly through Arrow&apos;s zero-copy interface. Combined with shared Iceberg table access, the two tools form a coherent local analytics toolkit that scales gracefully to cloud infrastructure when workload demands grow.&lt;/p&gt;
&lt;h3&gt;Explore Further&lt;/h3&gt;
&lt;p&gt;For a comprehensive guide to lakehouse architecture and the Iceberg ecosystem, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For multi-engine lakehouse access with query acceleration across your Iceberg tables, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>FinOps for Data Warehouses with Open Billing Data</title><link>https://iceberglakehouse.com/posts/2026-05-24-finops-warehouse-cost/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-finops-warehouse-cost/</guid><description>
Warehouse costs are the most visible and most contentious line item on a data platform&apos;s budget. Every query is metered. Every dashboard refresh cost...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Warehouse costs are the most visible and most contentious line item on a data platform&apos;s budget. Every query is metered. Every dashboard refresh costs something. Engineering leaders who can&apos;t explain where costs are coming from can&apos;t make informed decisions about where to cut, where to invest, or how to set fair internal budgets by team.&lt;/p&gt;
&lt;p&gt;The problem has been interoperability. Snowflake exposes cost data in its own schema format. BigQuery provides cost information through the &lt;code&gt;JOBS_BY_PROJECT&lt;/code&gt; view and billing export to BigQuery. AWS surfaces it through Cost Explorer and billing exports. None of these use a common format, which means building a unified view requires custom ETL jobs for each provider, jobs that break when providers change their export schemas.&lt;/p&gt;
&lt;p&gt;The FOCUS specification (FinOps Open Cost and Usage Specification), addresses this by defining a standard schema for cloud and SaaS billing data. FOCUS 1.3, ratified in December 2025, added shared cost allocation, contract commitment datasets, and data recency signals. It&apos;s the first version of the spec that makes warehouse FinOps across multiple providers genuinely tractable.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What FOCUS 1.3 Adds&lt;/h2&gt;
&lt;p&gt;The core FOCUS schema normalizes cloud billing across providers into a common set of fields: &lt;code&gt;BilledCost&lt;/code&gt;, &lt;code&gt;EffectiveCost&lt;/code&gt;, &lt;code&gt;ResourceId&lt;/code&gt;, &lt;code&gt;ServiceName&lt;/code&gt;, &lt;code&gt;SubAccountId&lt;/code&gt;, and &lt;code&gt;Tags&lt;/code&gt;. Every provider that implements FOCUS maps its billing data to these columns, allowing the same SQL queries to work across AWS, Azure, GCP, and SaaS providers that export FOCUS-formatted data.&lt;/p&gt;
&lt;p&gt;FOCUS 1.3 extends this with three important additions:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Shared cost allocation.&lt;/strong&gt; Previous FOCUS versions let you see costs per resource. 1.3 adds allocation columns that show how shared costs are split across workloads, the methodology behind the split, not just the result. For warehouse teams running shared compute across multiple user groups, this is the difference between &amp;quot;we spent $20K on shared virtual warehouses&amp;quot; and &amp;quot;here&apos;s how that $20K maps to each team&apos;s usage using the provider&apos;s allocation algorithm.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Contract commitment datasets.&lt;/strong&gt; A separate dataset tracks committed-use contracts, reservation start and end dates, committed quantities, remaining units, and contract descriptions. This makes it possible to track how much of a committed purchase is actually consumed versus wasted, and to attribute waste to specific allocation decisions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data recency and completeness signals.&lt;/strong&gt; New metadata fields indicate when the billing dataset was last updated and whether it&apos;s complete. This prevents common cost attribution errors where a reporting pipeline runs against incomplete billing data and produces partial results that mislead budget holders.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Building a Warehouse FinOps Pipeline&lt;/h2&gt;
&lt;p&gt;The practical architecture for multi-warehouse FinOps normalizes each provider&apos;s billing data into FOCUS format, loads it into a FinOps mart, and builds chargeback and budget reporting on top.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/finops-warehouse-cost/focus-warehouse-finops-pipeline.png&quot; alt=&quot;FOCUS-based warehouse FinOps pipeline showing Snowflake Query Cost API, BigQuery JOBS_BY_PROJECT view, and AWS Cost Explorer API all flowing into FOCUS 1.3 normalization layer, then to warehouse FinOps mart for chargeback dashboard and budget alerts&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snowflake cost ingestion:&lt;/strong&gt; Snowflake provides cost data through the &lt;code&gt;QUERY_ATTRIBUTION_HISTORY&lt;/code&gt; view (query-level costs), &lt;code&gt;METERING_HISTORY&lt;/code&gt; (virtual warehouse consumption by hour), and &lt;code&gt;RESOURCE_MONITOR_HISTORY&lt;/code&gt; (resource monitor usage against limits). For FOCUS normalization:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Snowflake FOCUS normalization query
SELECT
    start_time::DATE                                    AS ChargePeriodStart,
    end_time::DATE                                      AS ChargePeriodEnd,
    &apos;Snowflake&apos;                                         AS ServiceProvider,
    &apos;Compute&apos;                                           AS ServiceName,
    warehouse_name                                      AS ResourceId,
    credits_used * :credit_cost_usd                     AS BilledCost,
    credits_used * :credit_cost_usd                     AS EffectiveCost,
    OBJECT_CONSTRUCT(
        &apos;team&apos;, warehouse_tags:team::STRING,
        &apos;project&apos;, warehouse_tags:project::STRING
    )                                                   AS Tags
FROM snowflake.account_usage.metering_history
WHERE start_time &amp;gt;= :start_date
  AND start_time &amp;lt; :end_date;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;BigQuery cost ingestion:&lt;/strong&gt; BigQuery&apos;s &lt;code&gt;INFORMATION_SCHEMA.JOBS_BY_PROJECT&lt;/code&gt; view provides per-query cost estimates using &lt;code&gt;total_bytes_billed&lt;/code&gt; and the project&apos;s pricing tier. For chargeback, labels applied to queries or jobs serve as the team and project tags:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- BigQuery FOCUS normalization query
SELECT
    DATE(creation_time)                                   AS ChargePeriodStart,
    DATE(end_time)                                        AS ChargePeriodEnd,
    &apos;Google Cloud&apos;                                        AS ServiceProvider,
    &apos;BigQuery Compute&apos;                                    AS ServiceName,
    project_id                                            AS ResourceId,
    ROUND(total_bytes_billed / POW(10, 12) * 6.25, 4)   AS BilledCost,
    labels[&apos;team&apos;]                                        AS team_tag,
    labels[&apos;project&apos;]                                     AS project_tag
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time &amp;gt;= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = &apos;QUERY&apos;
  AND state = &apos;DONE&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Cost Attribution: The Tagging Problem&lt;/h2&gt;
&lt;p&gt;The most common failure mode in warehouse FinOps is unattributed queries, queries that run without metadata indicating which team or project owns them. As data platform usage grows, the fraction of unattributed costs tends to increase unless tagging is actively enforced.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/finops-warehouse-cost/warehouse-cost-attribution-by-team.png&quot; alt=&quot;Stacked bar chart showing warehouse cost attribution by team from January to May with unattributed queries growing from 15% to 25% of total cost, exceeding budget threshold in April&quot;&gt;&lt;/p&gt;
&lt;p&gt;The remediation is session-level tagging. In Snowflake, this means setting query tags at the session level for all tooling that runs queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Set query tag at session start (for Airflow, dbt, or custom tools)
ALTER SESSION SET QUERY_TAG = &apos;{&amp;quot;team&amp;quot;: &amp;quot;analytics_engineering&amp;quot;, &amp;quot;project&amp;quot;: &amp;quot;weekly_revenue_report&amp;quot;, &amp;quot;environment&amp;quot;: &amp;quot;production&amp;quot;}&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In BigQuery, job labels serve the same purpose. Any query submitted through the BigQuery API can include labels:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Python BigQuery client with labels for cost attribution
from google.cloud import bigquery

client = bigquery.Client()
job_config = bigquery.QueryJobConfig(
    labels={
        &amp;quot;team&amp;quot;: &amp;quot;data_science&amp;quot;,
        &amp;quot;project&amp;quot;: &amp;quot;churn_model_training&amp;quot;,
        &amp;quot;environment&amp;quot;: &amp;quot;production&amp;quot;
    }
)

query_job = client.query(
    &amp;quot;SELECT * FROM analytics.training_features LIMIT 1000&amp;quot;,
    job_config=job_config
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Enforcing tagging at the framework level (in Airflow operators, dbt profiles, and internal query runners), produces consistent attribution without requiring individual analysts to remember to set tags manually.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Chargeback vs Showback&lt;/h2&gt;
&lt;p&gt;Showback and chargeback serve different organizational purposes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Showback&lt;/strong&gt; presents cost data to teams without billing them directly. Teams can see their consumption and compare it against budgets, but costs are absorbed by a central platform budget. Showback is appropriate for platforms where granular internal billing creates more friction than value, or where pricing complexity makes it difficult to fairly allocate shared resources.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Chargeback&lt;/strong&gt; bills teams directly for their consumption, either through internal transfers or budget adjustments. Chargeback creates accountability but requires careful handling of shared resources (warehouses, storage) where individual query attribution is imprecise.&lt;/p&gt;
&lt;p&gt;FOCUS 1.3&apos;s shared cost allocation methodology fields support chargeback by documenting how shared costs are split, which matters when teams dispute allocations. Being able to show that $5K of shared compute was allocated to a team based on their percentage of query hours, using a documented methodology, is more defensible than showing a number without explanation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Commitment Discounts and Reserved Capacity Management&lt;/h2&gt;
&lt;p&gt;Most data warehouse providers offer commitment-based pricing that significantly reduces per-query or per-hour costs in exchange for minimum spend commitments. Snowflake&apos;s pre-purchased credits, Google BigQuery&apos;s flat-rate reservations, and AWS Athena&apos;s capacity reservations all operate on this model. Managing these commitments efficiently is one of the highest-leverage FinOps activities for mature data platforms.&lt;/p&gt;
&lt;p&gt;The challenge with commitment management is utilization. An organization that commits to $50K/month of Snowflake credits to access a 30% discount but only uses $35K of those credits is paying a 43% premium on its actual consumption. The discount evaporates if the commitment isn&apos;t fully consumed.&lt;/p&gt;
&lt;p&gt;FOCUS 1.3&apos;s contract commitment dataset tracks committed capacity against actual utilization, enabling a commitment health dashboard:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Commitment utilization rate:&lt;/strong&gt; Actual usage divided by committed quantity for the current period. Below 85% triggers investigation. Below 75% triggers a commitment renegotiation review.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Days remaining in commitment period:&lt;/strong&gt; How much time remains to consume the committed credits before the period ends.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Burn rate:&lt;/strong&gt; At the current daily consumption rate, will the commitment be consumed by period end?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For FinOps teams managing multiple warehouse commitments, a simple weekly report on these three metrics for each contract provides early warning before a period ends with significant unused commitment.&lt;/p&gt;
&lt;p&gt;The strategic decision is matching commitment size to anticipated usage with a safety margin. Committing to 90% of expected usage (rather than 100%) protects against consumption shortfalls at the cost of slightly higher per-unit pricing on the remaining 10%. Most organizations find that the risk-adjusted value of this buffer exceeds the cost savings of fully committing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The FinOps Culture Problem&lt;/h2&gt;
&lt;p&gt;Technology is the easier half of warehouse FinOps. The harder half is organizational: creating a culture where teams are aware of and accountable for their data infrastructure costs.&lt;/p&gt;
&lt;p&gt;FinOps culture breaks down at two common failure points. The first is when showback data reaches teams that have never been aware of infrastructure costs and the immediate response is confusion rather than action, &amp;quot;we generated $30K in warehouse costs last month&amp;quot; without context about whether that&apos;s good, bad, expected, or avoidable. The second is when chargeback creates political conflict rather than shared accountability, particularly when teams feel that cost allocations are unfair or opaque.&lt;/p&gt;
&lt;p&gt;Building a successful FinOps culture requires three investments beyond the technical pipeline:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost awareness education:&lt;/strong&gt; Teams that own data pipelines need enough context to interpret their cost reports. What does a BigQuery byte processed actually cost? What makes a query expensive? What&apos;s the difference between a cached result and a full scan? This doesn&apos;t require deep technical training, a one-hour workshop for analysts and data engineers on &amp;quot;how your queries turn into dollars&amp;quot; dramatically improves the quality of cost-aware behavior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Shared optimization incentives:&lt;/strong&gt; If engineering teams are charged for warehouse costs but have no mechanism to benefit from reducing them, the rational response is to treat it as a fixed overhead and move on. Creating a shared savings model (where teams that reduce their attributed costs keep a portion of the savings in their platform budget), aligns engineering incentives with cost efficiency.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Executive visibility:&lt;/strong&gt; FinOps programs that exist only in platform team dashboards don&apos;t change organizational behavior. Monthly cost reporting that reaches department heads, with clear attribution to teams and projects, creates the organizational pressure for cost accountability that no internal platform campaign can generate alone.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The FOCUS 1.3 specification provides the interoperability layer that makes multi-cloud and multi-warehouse FinOps practical. Combined with native warehouse cost views in Snowflake and BigQuery, it enables a real-time cost attribution pipeline that doesn&apos;t require custom ETL per provider.&lt;/p&gt;
&lt;p&gt;The operational priority is tagging discipline. A technically excellent FOCUS normalization pipeline produces limited value if 25% of queries run without attribution metadata. Enforce session-level tagging in every framework that touches the warehouse, validate it in CI, and monitor the unattributed fraction as a platform health metric.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Automated Cost Optimization: Resource Monitors and Budget Alerts&lt;/h2&gt;
&lt;p&gt;Monitoring costs after the fact is useful for reporting but not for controlling spending. Automated budget enforcement prevents runaway costs before they accumulate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snowflake Resource Monitors&lt;/strong&gt; allow administrators to set credit limits per virtual warehouse or account, with configurable actions when thresholds are reached:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a resource monitor for an analytics team&apos;s warehouse
CREATE RESOURCE MONITOR analytics_team_monitor
    WITH CREDIT_QUOTA = 500  -- 500 credits per month
    TRIGGERS ON 75 PERCENT DO NOTIFY
    TRIGGERS ON 90 PERCENT DO NOTIFY
    TRIGGERS ON 100 PERCENT DO SUSPEND;

-- Apply to a warehouse
ALTER WAREHOUSE analytics_warehouse
    SET RESOURCE_MONITOR = analytics_team_monitor;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the analytics team reaches 75% of their monthly credit budget, the monitor sends a notification. At 100%, the warehouse is automatically suspended until manually resumed or the next billing period. This prevents a runaway dbt job or an analyst&apos;s inefficient query from exhausting the entire month&apos;s budget in a week.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BigQuery Scheduled Queries for Budget Alerts&lt;/strong&gt; use the INFORMATION_SCHEMA to monitor burn rate in near-real-time:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- BigQuery: daily cost monitoring with burn rate projection
WITH daily_costs AS (
    SELECT
        DATE(creation_time) AS query_date,
        labels[&apos;team&apos;] AS team,
        SUM(total_bytes_billed) / POW(10, 12) * 6.25 AS daily_cost_usd
    FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
    WHERE creation_time &amp;gt;= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
    GROUP BY 1, 2
),
team_burn_rate AS (
    SELECT
        team,
        AVG(daily_cost_usd) AS avg_daily_cost,
        -- Project monthly cost based on last 7 days
        AVG(daily_cost_usd) FILTER (WHERE query_date &amp;gt;= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)) * 30 AS projected_monthly_cost
    FROM daily_costs
    GROUP BY team
)
SELECT
    team,
    avg_daily_cost,
    projected_monthly_cost,
    CASE
        WHEN projected_monthly_cost &amp;gt; team_budget_usd * 0.9 THEN &apos;ALERT: Near budget limit&apos;
        WHEN projected_monthly_cost &amp;gt; team_budget_usd * 0.7 THEN &apos;WARNING: 70% of budget on track&apos;
        ELSE &apos;OK&apos;
    END AS budget_status
FROM team_burn_rate
JOIN team_budgets USING (team);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Scheduling this query to run hourly and alerting when &lt;code&gt;budget_status = &apos;ALERT&apos;&lt;/code&gt; provides proactive budget management that catches overspend early enough to take corrective action.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Cost Efficiency Metrics: Beyond Total Spend&lt;/h2&gt;
&lt;p&gt;Total spend is a useful metric but an incomplete one. A team that doubled their query volume while keeping costs flat has improved efficiency. A team that cut their queries in half but costs stayed the same has a performance problem.&lt;/p&gt;
&lt;p&gt;Cost efficiency metrics provide the denominator that makes spend numbers meaningful:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost per query:&lt;/strong&gt; Total warehouse cost divided by number of queries. Declining cost per query indicates that query optimization, caching, or materialization is working.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost per business outcome:&lt;/strong&gt; For analytical teams, this might be cost per report delivered, cost per dashboard view, or cost per data product refresh. This connects infrastructure spending to business value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cache hit rate:&lt;/strong&gt; Dremio&apos;s Reflections, Snowflake result caches, and BigQuery BI Engine all provide query acceleration through caching and materialization. A high cache hit rate means the same compute is serving more queries. Track cache hit rate as a cost efficiency indicator.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query efficiency ratio:&lt;/strong&gt; Bytes processed divided by bytes returned. High ratios (processing much more than returned) indicate opportunities for partition pruning, materialization, or query optimization. Snowflake and BigQuery both expose this ratio in their query metadata views.&lt;/p&gt;
&lt;p&gt;Building a simple cost efficiency dashboard (cost per query over time, cache hit rate, bytes processed ratio), gives platform teams the signal they need to identify optimization opportunities before they pursue spending cuts that might affect analytics quality.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build a Financially Accountable Data Platform&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on modern data architecture, governance, and cost management, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides unified query access across your lakehouse with query reflection caching that reduces warehouse compute costs. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Designing Governed RAG on Data Products</title><link>https://iceberglakehouse.com/posts/2026-05-24-governed-rag-data-products/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-governed-rag-data-products/</guid><description>
The first generation of enterprise RAG deployments had a serious trust problem. Organizations gave AI assistants access to the data warehouse (or to ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The first generation of enterprise RAG deployments had a serious trust problem. Organizations gave AI assistants access to the data warehouse (or to a vector store filled with documents scraped from internal wikis and Confluence) and discovered that the answers came back authoritative-sounding but frequently wrong, stale, or based on data the querying user wasn&apos;t supposed to see.&lt;/p&gt;
&lt;p&gt;The &amp;quot;give the model warehouse access&amp;quot; approach conflates two separate problems: retrieval (finding relevant context) and governance (ensuring the retrieved context is accurate, fresh, and appropriate for the user). When these problems aren&apos;t separated architecturally, you get an AI system that confidently answers questions using data it shouldn&apos;t have accessed, or that retrieves stale snapshots from a document store that hasn&apos;t been updated in six months.&lt;/p&gt;
&lt;p&gt;Governed RAG on data products separates these concerns. The retrieval layer enforces access policies before context reaches the model. Retrieved data comes from data products with defined SLAs and freshness guarantees, not from unstructured document dumps. And a semantic layer ensures that structured data queries generate consistent, policy-compliant SQL rather than ad-hoc queries that might bypass governance controls.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Architecture&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/governed-rag-data-products/governed-rag-architecture.png&quot; alt=&quot;Governed RAG architecture showing user/AI agent flowing through query rewrite and intent classification, splitting into retrieval from vector store via data product catalog and policy check, and structured data query via policy check and semantic layer, both assembling context for LLM generation with governance layer alongside&quot;&gt;&lt;/p&gt;
&lt;p&gt;The architecture has three layers:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The retrieval layer&lt;/strong&gt; handles unstructured context: documents, policies, runbooks, product documentation. It uses a vector store indexed from data products (governed, SLA-backed datasets), rather than open-ended document crawls. Access policies at the retrieval layer filter returned chunks to those the requesting user is authorized to see.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The structured query layer&lt;/strong&gt; handles precise, quantitative questions that require SQL: &amp;quot;what was our revenue in Q1?&amp;quot;, &amp;quot;how many active users do we have?&amp;quot;, &amp;quot;what&apos;s the churn rate by region?&amp;quot;. This layer routes through a semantic layer (dbt Semantic Layer, Snowflake Cortex Analyst, or similar) that generates deterministic SQL using governed metric definitions, not ad-hoc LLM-generated SQL against raw schema.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The governance layer&lt;/strong&gt; enforces access control at both retrieval paths. A user asking about EMEA revenue who only has access to EMEA data gets EMEA metrics, not because the LLM magically knows the constraint, but because the policy check at the retrieval layer filtered the context and the row filter on the warehouse query enforced the regional restriction.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Data Products as Retrieval Sources&lt;/h2&gt;
&lt;p&gt;The fundamental governance improvement in governed RAG is sourcing retrieval context from data products rather than unstructured document stores.&lt;/p&gt;
&lt;p&gt;A data product is a curated dataset published by a domain team with explicit quality contracts: a defined schema, documented ownership, SLAs for freshness and availability, and access controls. When your RAG system retrieves from data products, it retrieves from sources that have owners who are accountable for their accuracy, update on known schedules, and apply to documented policies.&lt;/p&gt;
&lt;p&gt;Compare this to the alternative: a vector store populated by crawling internal wikis. Wiki documents have unknown freshness, no quality SLAs, no access control auditing, and no owner accountable for accuracy. When an AI assistant generates an answer from a two-year-old policy document that was superseded, nobody is responsible for that retrieval decision.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.vectorstores import Weaviate
from langchain.schema import Document
import weaviate

# Index only from governed data products with metadata
client = weaviate.Client(&amp;quot;http://localhost:8080&amp;quot;)

def index_data_product(
    product_name: str,
    version: str,
    freshness_timestamp: str,
    owner_team: str,
    access_tags: list[str],
    content_chunks: list[str],
    embeddings: list[list[float]]
):
    &amp;quot;&amp;quot;&amp;quot;
    Index a data product&apos;s content with governance metadata.
    Access control enforced at query time using access_tags.
    &amp;quot;&amp;quot;&amp;quot;
    for i, (chunk, embedding) in enumerate(zip(content_chunks, embeddings)):
        client.data_object.create(
            data_object={
                &amp;quot;content&amp;quot;: chunk,
                &amp;quot;product_name&amp;quot;: product_name,
                &amp;quot;version&amp;quot;: version,
                &amp;quot;freshness_ts&amp;quot;: freshness_timestamp,
                &amp;quot;owner_team&amp;quot;: owner_team,
                &amp;quot;access_tags&amp;quot;: access_tags  # Used to filter at retrieval time
            },
            class_name=&amp;quot;DataProductChunk&amp;quot;,
            vector=embedding
        )
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Access Policy Enforcement at Retrieval&lt;/h2&gt;
&lt;p&gt;The access policy check happens before content reaches the LLM context window. A user with ANALYST role and EMEA regional access should retrieve only chunks tagged for their role and region. This is enforced in the vector store query filter, not in a prompt instruction to the LLM.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def retrieve_with_policy(
    query_embedding: list[float],
    user_role: str,
    user_region: str,
    k: int = 10
) -&amp;gt; list[dict]:
    &amp;quot;&amp;quot;&amp;quot;
    Retrieve relevant chunks with access policy applied at query time.
    Policy is enforced in the vector store query, not in the LLM prompt.
    &amp;quot;&amp;quot;&amp;quot;
    # Policy-filtered retrieval: only return chunks the user can see
    results = client.query.get(
        &amp;quot;DataProductChunk&amp;quot;,
        [&amp;quot;content&amp;quot;, &amp;quot;product_name&amp;quot;, &amp;quot;freshness_ts&amp;quot;, &amp;quot;owner_team&amp;quot;]
    ).with_near_vector({
        &amp;quot;vector&amp;quot;: query_embedding
    }).with_where({
        &amp;quot;operator&amp;quot;: &amp;quot;And&amp;quot;,
        &amp;quot;operands&amp;quot;: [
            {
                &amp;quot;path&amp;quot;: [&amp;quot;access_tags&amp;quot;],
                &amp;quot;operator&amp;quot;: &amp;quot;ContainsAny&amp;quot;,
                &amp;quot;valueText&amp;quot;: [user_role, &amp;quot;PUBLIC&amp;quot;, f&amp;quot;REGION_{user_region}&amp;quot;]
            },
            {
                &amp;quot;path&amp;quot;: [&amp;quot;freshness_ts&amp;quot;],
                &amp;quot;operator&amp;quot;: &amp;quot;GreaterThan&amp;quot;,
                &amp;quot;valueDate&amp;quot;: &amp;quot;2025-01-01T00:00:00Z&amp;quot;  # Freshness gate
            }
        ]
    }).with_limit(k).do()

    return results[&amp;quot;data&amp;quot;][&amp;quot;Get&amp;quot;][&amp;quot;DataProductChunk&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The freshness gate is particularly important. Stale context is a common source of AI assistant errors in enterprise settings. Setting a maximum staleness threshold at the retrieval layer ensures the model never generates answers from outdated data, even if the data product temporarily falls behind its SLA.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Routing Quantitative Questions Through the Semantic Layer&lt;/h2&gt;
&lt;p&gt;Questions that require precise calculation (revenue, user counts, conversion rates), should not be answered from retrieved document chunks. They should route to the semantic layer for deterministic SQL generation.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from openai import OpenAI

def classify_and_route(user_question: str, user_context: dict) -&amp;gt; dict:
    &amp;quot;&amp;quot;&amp;quot;
    Classify the question and route to appropriate retrieval path.
    Returns context assembled from the appropriate source.
    &amp;quot;&amp;quot;&amp;quot;
    # Simple classification: metric questions vs knowledge questions
    metric_keywords = [&amp;quot;revenue&amp;quot;, &amp;quot;count&amp;quot;, &amp;quot;rate&amp;quot;, &amp;quot;percentage&amp;quot;, &amp;quot;average&amp;quot;, &amp;quot;total&amp;quot;]
    is_metric_question = any(kw in user_question.lower() for kw in metric_keywords)

    if is_metric_question:
        # Route to semantic layer for deterministic SQL
        sql = generate_metric_sql(
            question=user_question,
            user_region=user_context[&amp;quot;region&amp;quot;],
            user_role=user_context[&amp;quot;role&amp;quot;]
        )
        return {&amp;quot;type&amp;quot;: &amp;quot;structured&amp;quot;, &amp;quot;sql&amp;quot;: sql, &amp;quot;source&amp;quot;: &amp;quot;semantic_layer&amp;quot;}
    else:
        # Route to vector retrieval from data products
        chunks = retrieve_with_policy(
            query_embedding=embed(user_question),
            user_role=user_context[&amp;quot;role&amp;quot;],
            user_region=user_context[&amp;quot;region&amp;quot;]
        )
        return {&amp;quot;type&amp;quot;: &amp;quot;retrieval&amp;quot;, &amp;quot;chunks&amp;quot;: chunks, &amp;quot;source&amp;quot;: &amp;quot;data_products&amp;quot;}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Evaluating RAG Response Quality&lt;/h2&gt;
&lt;p&gt;Governed retrieval doesn&apos;t automatically produce good answers. The access policy filters ensure users only see authorized context, but they don&apos;t ensure the retrieved context is relevant or that the LLM uses it accurately. Evaluation tooling is necessary.&lt;/p&gt;
&lt;p&gt;Key evaluation metrics for governed RAG pipelines:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context relevance.&lt;/strong&gt; For each retrieved chunk, how semantically similar is it to the question being answered? Low relevance scores indicate the vector index or filtering is returning technically authorized but topically off-target results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Faithfulness.&lt;/strong&gt; Does the LLM&apos;s generated answer accurately reflect the information in the retrieved chunks? Hallucination detection compares the generated claims against the retrieved context, flagging answers that assert information not present in the context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Answer relevance.&lt;/strong&gt; Is the generated answer actually responsive to the user&apos;s question? An answer that accurately summarizes retrieved context but doesn&apos;t address what was asked is technically faithful but practically useless.&lt;/p&gt;
&lt;p&gt;MLflow 3&apos;s evaluation framework supports these metrics for RAG pipelines:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow

with mlflow.start_run():
    eval_results = mlflow.evaluate(
        model=governed_rag_pipeline,
        data=eval_dataset,
        model_type=&amp;quot;question-answering&amp;quot;,
        evaluators=[&amp;quot;default&amp;quot;],
        extra_metrics=[
            mlflow.metrics.genai.faithfulness(model=&amp;quot;openai:/gpt-4o&amp;quot;),
            mlflow.metrics.genai.relevance(model=&amp;quot;openai:/gpt-4o&amp;quot;),
            mlflow.metrics.genai.answer_correctness(model=&amp;quot;openai:/gpt-4o&amp;quot;)
        ]
    )
    mlflow.log_metric(&amp;quot;avg_faithfulness&amp;quot;, eval_results.metrics[&amp;quot;faithfulness/v1/mean&amp;quot;])
    mlflow.log_metric(&amp;quot;avg_relevance&amp;quot;, eval_results.metrics[&amp;quot;relevance/v1/mean&amp;quot;])
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Snowflake Horizon AI Guardrails&lt;/h2&gt;
&lt;p&gt;Snowflake Horizon extends its governance framework to AI workloads through AI guardrails, policies that restrict what AI systems can access when using Cortex and AI-native features. For organizations using Snowflake as the data product backing for RAG, Horizon&apos;s AI guardrails add a policy layer limiting which subsets of authorized data can be included in AI context.&lt;/p&gt;
&lt;p&gt;A sensitive financial table might be accessible to ANALYST role for SQL queries but excluded from AI context by guardrail policy, defense in depth that goes beyond standard role-based access.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Restrict AI access to non-PII columns
CREATE OR REPLACE AI USAGE POLICY restrict_ai_pii
    BLOCK ENTITIES sensitive_columns_table
    ON COLUMN email, phone, ssn;

ALTER TABLE customers
    SET AI USAGE POLICY restrict_ai_pii;
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Building the Audit Trail&lt;/h2&gt;
&lt;p&gt;One of the underappreciated requirements of enterprise RAG is the audit trail. When a business decision is made based on an AI assistant&apos;s recommendation, auditors may need to know: what data did the AI see? What was retrieved for that specific query?&lt;/p&gt;
&lt;p&gt;The governed RAG architecture enables this audit trail by design:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Retrieval logs:&lt;/strong&gt; Every vector store query, including filter conditions applied (role, region, freshness), logged with user identity and timestamp.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Source attribution:&lt;/strong&gt; Generated responses include citations to specific data product chunks retrieved, with version and freshness information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SQL query log:&lt;/strong&gt; For structured queries routed to the semantic layer, generated SQL and warehouse execution plan are logged alongside the natural language question.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM trace:&lt;/strong&gt; MLflow tracing captures the full prompt, context, and response for each generation.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This four-layer audit trail satisfies most enterprise audit requirements, far more complete than what an ungoverned RAG system can provide.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Choosing Retrieval Infrastructure for Production&lt;/h2&gt;
&lt;p&gt;Vector store selection for governed RAG requires evaluating filtering capabilities alongside retrieval performance. Not every vector store implements the kind of metadata-filtered search required for access-controlled retrieval.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaviate&lt;/strong&gt; supports filtering on arbitrary metadata fields at query time, making it a natural choice for governed RAG where access tags and freshness timestamps are part of the filter expression.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;pgvector&lt;/strong&gt; supports SQL WHERE clause filtering alongside similarity search, which makes access control filters natural extensions of existing PostgreSQL policy logic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Milvus&lt;/strong&gt; supports partition-based access control where different access tiers can be stored in different partitions, reducing filter overhead for large-scale deployments.&lt;/p&gt;
&lt;p&gt;For most enterprise RAG deployments starting from scratch, the choice comes down to operational complexity tolerance. pgvector has the lowest operational overhead for teams already running PostgreSQL. Weaviate provides the richest native hybrid search and filter capabilities. Milvus provides the best scale-out path for deployments that expect to grow to billions of vectors.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Governed RAG on data products is the enterprise-grade version of retrieval-augmented generation. It&apos;s not the easiest path to a working demo; ungoverned RAG is faster to build. But it&apos;s the only version that produces responses an organization can trust, audit, and stand behind when someone asks how the AI came to a particular conclusion.&lt;/p&gt;
&lt;p&gt;The key disciplines: source retrieval from data products with explicit ownership and freshness SLAs, enforce access policies at the retrieval layer (not in LLM prompts), route quantitative questions through a governed semantic layer, evaluate response quality against faithfulness and relevance metrics, and maintain a four-layer audit trail for accountability. This architecture makes AI assistants into reliable tools rather than plausible-sounding liability generators.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Chunking Strategy: The Hidden RAG Variable&lt;/h2&gt;
&lt;p&gt;Retrieval quality in RAG systems is heavily influenced by how documents are chunked before embedding. The chunking strategy determines the granularity of retrieval, too coarse, and the retrieved chunks contain irrelevant content that adds noise to the LLM prompt; too fine, and the chunks lack sufficient context for the LLM to reason effectively.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fixed-size chunking&lt;/strong&gt; divides documents into equal-length windows (e.g., 512 tokens) with optional overlap. It&apos;s simple but semantically arbitrary, a chunk boundary might fall in the middle of a sentence or concept.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic chunking&lt;/strong&gt; uses embedding similarity to detect natural breakpoints where the semantic content shifts. Chunks within the same section tend to have high cosine similarity; the similarity drops at section boundaries. This produces chunks that align with the document&apos;s conceptual structure rather than its character count.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hierarchical chunking&lt;/strong&gt; creates a two-level index: large parent chunks for broad context and small child chunks for precise retrieval. Retrieval uses the small chunks for semantic similarity search, but the LLM receives the full parent chunk as context. This preserves retrieval precision while giving the model adequate context window content.&lt;/p&gt;
&lt;p&gt;For enterprise document corpora (policy documents, technical manuals, internal knowledge bases), hierarchical chunking consistently outperforms fixed-size chunking on faithfulness metrics. The implementation requires storing parent-child chunk relationships in the vector store metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def hierarchical_chunk(document: str, parent_size: int = 1500, child_size: int = 300) -&amp;gt; list[dict]:
    &amp;quot;&amp;quot;&amp;quot;
    Create hierarchical chunks with parent-child relationships.
    Returns a list of chunk records with parent references.
    &amp;quot;&amp;quot;&amp;quot;
    chunks = []
    parent_id = 0

    # Create parent chunks
    parent_windows = create_fixed_windows(document, parent_size, overlap=150)

    for parent_text in parent_windows:
        # Create child chunks from each parent
        child_windows = create_fixed_windows(parent_text, child_size, overlap=50)

        for child_text in child_windows:
            chunks.append({
                &amp;quot;text&amp;quot;: child_text,
                &amp;quot;parent_text&amp;quot;: parent_text,
                &amp;quot;parent_id&amp;quot;: parent_id,
                &amp;quot;embedding&amp;quot;: embed_text(child_text),  # Child embedding for retrieval
            })
        parent_id += 1

    return chunks
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At retrieval time, search returns child chunks, but the LLM receives the full parent text from the &lt;code&gt;parent_text&lt;/code&gt; field.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Context Window Management&lt;/h2&gt;
&lt;p&gt;Enterprise LLM deployments face a constraint that research demos ignore: the context window has a finite size. Claude 3.5 Sonnet supports 200K tokens; GPT-4o supports 128K. These seem large, but a RAG system that retrieves 10 chunks of 1500 tokens each, plus conversation history, plus the system prompt, plus structured data from the semantic layer can easily approach or exceed the limit.&lt;/p&gt;
&lt;p&gt;Context window budget management requires explicitly tracking what goes into the prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def build_governed_rag_prompt(
    query: str,
    retrieved_chunks: list[dict],
    structured_data: dict | None,
    conversation_history: list[dict],
    max_context_tokens: int = 80_000  # Conservative budget below model limit
) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Build a prompt that fits within the context window budget.
    Prioritizes: system prompt &amp;gt; structured data &amp;gt; relevant chunks &amp;gt; history
    &amp;quot;&amp;quot;&amp;quot;
    system_prompt = load_system_prompt()
    system_tokens = count_tokens(system_prompt)

    # Reserve budget for response
    response_budget = 4_000
    available = max_context_tokens - system_tokens - response_budget

    # Structured data is highest priority (exact facts)
    structured_section = format_structured_data(structured_data) if structured_data else &amp;quot;&amp;quot;
    available -= count_tokens(structured_section)

    # History (most recent first, truncated to available budget)
    history_section = truncate_history_to_budget(conversation_history, available // 3)
    available -= count_tokens(history_section)

    # Fill remaining budget with retrieved chunks (most relevant first)
    chunks_section = fill_chunks_to_budget(retrieved_chunks, available)

    return f&amp;quot;{system_prompt}\n\n{structured_section}\n\n{chunks_section}\n\n{history_section}\n\nQuestion: {query}&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This explicit budget management prevents context overflow while ensuring the most reliable content (structured semantic layer data) gets priority over less reliable content (unstructured retrieved chunks).&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Dual-Pipeline Architecture for Production RAG&lt;/h2&gt;
&lt;p&gt;Production RAG systems serving enterprise users typically benefit from separating the retrieval and generation concerns into independent scaling dimensions. A query routing layer directs different question types to different backends:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Quantitative questions&lt;/strong&gt; → Semantic layer SQL generation → Database execution → Structured result formatting&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qualitative questions about policies, procedures, documentation&lt;/strong&gt; → Vector retrieval → LLM generation with retrieved context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mixed questions&lt;/strong&gt; → Both pipelines in parallel → LLM synthesis combining structured data and retrieved text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-pipeline design means the vector store and the SQL execution engine scale independently. When the volume of quantitative queries grows (because more users are asking &amp;quot;what&apos;s my team&apos;s budget status this quarter&amp;quot;), compute scales on the SQL path. When document corpus grows, storage and indexing scales on the vector path.&lt;/p&gt;
&lt;p&gt;The routing logic is itself a classification model, either a fine-tuned classifier or a smaller, fast LLM that categorizes the incoming question before routing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def route_query(query: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;Returns &apos;structured&apos;, &apos;unstructured&apos;, or &apos;hybrid&apos;&amp;quot;&amp;quot;&amp;quot;
    classifier = load_query_classifier()
    return classifier.predict(query)

def answer_question(query: str, user_context: dict) -&amp;gt; dict:
    route = route_query(query)

    if route == &amp;quot;structured&amp;quot;:
        sql = generate_governed_sql(query, user_context)
        result = execute_against_semantic_layer(sql)
        return format_structured_response(result, query)

    elif route == &amp;quot;unstructured&amp;quot;:
        chunks = retrieve_with_access_control(query, user_context)
        return generate_rag_response(query, chunks)

    else:  # hybrid
        sql = generate_governed_sql(query, user_context)
        structured = execute_against_semantic_layer(sql)
        chunks = retrieve_with_access_control(query, user_context)
        return generate_hybrid_response(query, structured, chunks)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The dual-pipeline architecture adds implementation complexity but significantly improves answer quality for mixed enterprise use cases where questions range from precise metric lookups to open-ended policy questions.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build AI-Ready Data Platforms&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on agentic AI integration with lakehouse data architecture, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides governed multi-engine query access to your Iceberg lakehouse, making it an ideal data product foundation for enterprise RAG. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Iceberg V3 Advances Mean for CDC Pipelines</title><link>https://iceberglakehouse.com/posts/2026-05-24-iceberg-cdc-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-iceberg-cdc-pipelines/</guid><description>
Change Data Capture pipelines expose one of Apache Iceberg&apos;s most persistent weaknesses: its original mechanism for handling updates and deletes. Whe...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Change Data Capture pipelines expose one of Apache Iceberg&apos;s most persistent weaknesses: its original mechanism for handling updates and deletes. When you stream CDC events into Iceberg using merge-on-read semantics, you accumulate delete files. Each update or delete operation for a row creates a separate positional delete file that the query engine must reconcile against the original data file at read time. The delete files pile up between compaction runs. Read performance degrades. Compaction becomes a continuous obligation.&lt;/p&gt;
&lt;p&gt;Iceberg format version 3, ratified in 2024 and moving to full production stability with the Apache Iceberg 1.11.0 release in May 2026, replaces this design with binary deletion vectors. Combined with native row lineage tracking, these two changes reshape how CDC pipelines can be built and maintained. They don&apos;t eliminate all CDC complexity, but they remove the structural weaknesses that made Iceberg an awkward fit for high-churn streaming workloads.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Core Problem with Iceberg V2 and CDC&lt;/h2&gt;
&lt;p&gt;In Iceberg V2, updating or deleting a row doesn&apos;t rewrite the original data file. Instead, the engine writes a separate delete file, either a positional delete file that records the file path and row offset of each deleted row, or an equality delete file that records key values to be matched and removed at query time.&lt;/p&gt;
&lt;p&gt;This design was an improvement over full copy-on-write rewrites for individual row mutations. But it introduced a different problem: delete file accumulation. Every streaming CDC commit adds more delete files. A high-churn table receiving thousands of updates per second can accumulate thousands of delete files per hour. Queries must scan all relevant delete files and apply them to data files before returning results. Planning time increases with delete file count, not with data volume.&lt;/p&gt;
&lt;p&gt;The remediation is compaction. Running &lt;code&gt;RewriteDataFiles&lt;/code&gt; with the binpack strategy merges delete files back into data files, producing clean, fully materialized Parquet outputs. But compaction is expensive, and streaming pipelines produce mutations faster than compaction can clean up unless you dedicate substantial compute to maintenance jobs running continuously in parallel with your ingestion.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Iceberg V3: Deletion Vectors&lt;/h2&gt;
&lt;p&gt;The binary deletion vector (DV) mechanism in Iceberg V3 addresses the delete file accumulation problem at the format level. Instead of writing a separate delete file for each mutation, the engine writes a compact binary bitmap (stored in a Puffin statistics file), that marks which row positions in a data file are deleted.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-cdc-pipelines/deletion-vectors-vs-positional-deletes.png&quot; alt=&quot;Architecture comparison showing Iceberg V2 creating separate positional delete Avro files for each delete operation versus Iceberg V3 using compact binary bitmaps stored in Puffin files alongside data files&quot;&gt;&lt;/p&gt;
&lt;p&gt;A Puffin file is Iceberg&apos;s extensible statistics file format. In V3, it also serves as the container for deletion vector bitmaps. Each data file has at most one associated DV bitmap. When a row is deleted, the engine sets the corresponding bit in the bitmap. When multiple rows in the same data file are deleted in separate commits, the bitmaps are merged (OR-ing the bits), rather than creating new files.&lt;/p&gt;
&lt;p&gt;The operational implications are significant:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No separate delete file per operation.&lt;/strong&gt; A table receiving thousands of individual row deletes per second produces one bitmap update per data file instead of thousands of individual delete files. The file count per partition remains stable regardless of deletion rate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Faster read-time reconciliation.&lt;/strong&gt; Applying a bitmap to filter out deleted rows is a vectorized operation. The query engine reads the bitmap, applies it as a bitmask to the row group during Parquet scan, and skips deleted rows without needing to scan a separate file and perform a join-like reconciliation. This is substantially faster than the equality delete join that V2 required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Smaller metadata overhead.&lt;/strong&gt; Delete bitmaps are compact. A bitmap tracking deleted rows across a 128 MB Parquet file containing millions of rows takes kilobytes, not megabytes.&lt;/p&gt;
&lt;p&gt;The upgrade path from V2 to V3 is a one-way operation. You can enable V3 on a table using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Upgrade an existing Iceberg table to format version 3
ALTER TABLE my_catalog.analytics.events
SET TBLPROPERTIES (
    &apos;format-version&apos; = &apos;3&apos;,
    &apos;write.delete.mode&apos; = &apos;merge-on-read&apos;,
    &apos;write.update.mode&apos; = &apos;merge-on-read&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once upgraded, the table uses deletion vectors for all subsequent delete and update operations. Existing V2-format data files and positional delete files from before the upgrade remain readable. V3-written files and bitmaps coexist with V2 files until compaction rewrites them.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Row Lineage: Native Incremental Processing&lt;/h2&gt;
&lt;p&gt;The second major V3 addition for CDC pipelines is row lineage. This feature adds two system-generated metadata fields to every Iceberg table: &lt;code&gt;_first_row_id&lt;/code&gt; and &lt;code&gt;_added_rows&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;_first_row_id&lt;/code&gt; assigns a monotonically increasing integer identifier to each row when it is first written. This identifier is stable across updates: if row A is written in snapshot 5 with &lt;code&gt;_first_row_id = 1001&lt;/code&gt;, and then updated in snapshot 12, the row still carries &lt;code&gt;_first_row_id = 1001&lt;/code&gt; in the updated version.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;_added_rows&lt;/code&gt; records how many rows were added to a data file in the snapshot that wrote it.&lt;/p&gt;
&lt;p&gt;Together, these fields provide a native mechanism for incremental reads that doesn&apos;t require engine-specific CDC implementations. Downstream systems can query for rows added since a known snapshot by filtering on &lt;code&gt;_first_row_id &amp;gt; last_known_max&lt;/code&gt;. They can identify exactly which rows changed between snapshots by comparing row IDs across the incremental range.&lt;/p&gt;
&lt;p&gt;Before row lineage, incremental reads from Iceberg required either full snapshot comparison (expensive) or engine-specific extension metadata that wasn&apos;t portable across different query engines. Row lineage makes this portable at the format level.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query rows added since a known snapshot using row lineage
SELECT *
FROM my_catalog.analytics.orders
WHERE _first_row_id &amp;gt; 5000000
  AND event_date &amp;gt;= CURRENT_DATE - INTERVAL &apos;1&apos; DAY
ORDER BY _first_row_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For audit and compliance use cases, row lineage provides a trail of every row&apos;s origin that persists even after updates. A row&apos;s &lt;code&gt;_first_row_id&lt;/code&gt; never changes, allowing you to trace when a piece of data first entered the system regardless of how many times it was subsequently updated.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;End-to-End CDC Pipeline Architecture with V3&lt;/h2&gt;
&lt;p&gt;The practical CDC architecture using Iceberg V3 looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-cdc-pipelines/iceberg-v3-cdc-pipeline-architecture.png&quot; alt=&quot;End-to-end CDC pipeline from source database through Debezium and Kafka to Flink stream processor, then into Iceberg V3 tables with deletion vectors, row lineage metadata, and audit trail snapshots for downstream analytics&quot;&gt;&lt;/p&gt;
&lt;p&gt;A Debezium connector captures row-level changes from MySQL or PostgreSQL and publishes them as structured CDC events to Kafka. Each event contains the operation type (INSERT, UPDATE, DELETE), the before image of the row, and the after image.&lt;/p&gt;
&lt;p&gt;A Flink job consumes these events and applies them to the Iceberg table using the Iceberg Flink connector. For INSERT operations, new rows are written with fresh &lt;code&gt;_first_row_id&lt;/code&gt; values. For UPDATE operations, the old row is marked in the deletion vector bitmap and the new row image is written to a new data file. For DELETE operations, the row position is marked in the bitmap.&lt;/p&gt;
&lt;p&gt;This design avoids the separate delete file accumulation problem entirely. High-velocity UPDATE streams produce bitmap updates rather than ever-growing delete file collections.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Engine Support and Compatibility&lt;/h2&gt;
&lt;p&gt;Iceberg V3 requires engine support to use deletion vectors and row lineage effectively. As of mid-2026, support has stabilized across major engines:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-cdc-pipelines/iceberg-v3-engine-support-matrix.png&quot; alt=&quot;Feature support matrix for Iceberg V3 showing deletion vectors, row lineage metadata, Parquet V3 statistics, and DV compaction support levels across Apache Spark, Apache Flink, Trino/Presto, and Dremio&quot;&gt;&lt;/p&gt;
&lt;p&gt;Apache Spark has the most complete V3 support, having driven much of the specification work. Apache Flink supports deletion vectors and DV compaction fully; row lineage support is in progress as of mid-2026. Trino supports deletion vectors for reads and writes but row lineage and DV-aware compaction are not yet fully available. Dremio has been adding comprehensive V3 support including row lineage in its query planning and result delivery.&lt;/p&gt;
&lt;p&gt;The backward compatibility story is solid. V3 tables can still be read by engines that support V2 Iceberg format, though those engines will fall back to treating deletion vector bitmaps as unknown statistics and may not apply deletes correctly. Before upgrading tables to V3, confirm that all query engines in your data platform support reading V3-format files.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What V3 Doesn&apos;t Change&lt;/h2&gt;
&lt;p&gt;Iceberg V3 doesn&apos;t eliminate the need for compaction. Deletion vector bitmaps reduce the number of separate files per partition, but data files still fragment from streaming writes. Compaction remains necessary to merge small data files and rewrite deletion vector bitmaps into fully materialized clean files. The difference is that compaction frequency can be lower because bitmaps accumulate more gracefully than separate delete files.&lt;/p&gt;
&lt;p&gt;V3 also doesn&apos;t change the upgrade path complexity. Tables must be explicitly upgraded from V2 to V3. In environments with many active tables, this requires a coordinated migration plan; you can&apos;t upgrade all tables simultaneously without testing engine compatibility and validating query results.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Iceberg V3&apos;s deletion vectors and row lineage features are the most significant improvements to the format&apos;s CDC story since the introduction of merge-on-read semantics. Deletion vectors replace the separate delete file design that caused metadata bloat in high-mutation streaming environments. Row lineage provides a portable, engine-independent mechanism for incremental reads and audit trails.&lt;/p&gt;
&lt;p&gt;For CDC pipeline teams, the practical step is to test V3 deletion vectors on your most write-heavy tables, validate that your downstream query engines support V3 reads, and plan an incremental migration. Don&apos;t upgrade everything at once, start with the tables where delete file accumulation is currently causing compaction pressure and measure the improvement before rolling out broadly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Practical Debezium Setup for Iceberg CDC&lt;/h2&gt;
&lt;p&gt;Getting a Debezium-to-Iceberg CDC pipeline working in practice involves several configuration choices that have significant downstream effects on your Iceberg table structure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choosing the Kafka topic structure.&lt;/strong&gt; Each Debezium connector produces events for a specific database table. The event schema includes the before and after row images and the operation type. For Iceberg pipelines, the most common pattern is one Kafka topic per source table, with a Flink consumer reading each topic and writing to a corresponding Iceberg table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snapshot mode.&lt;/strong&gt; The first time Debezium connects to a source database, it takes a full snapshot of existing data before streaming CDC events. For large tables (millions of rows), the snapshot can take hours. The Iceberg target table must be empty or handle idempotent writes from the snapshot before receiving streaming events. The &lt;code&gt;snapshot.mode&lt;/code&gt; configuration in Debezium controls this behavior, &lt;code&gt;initial&lt;/code&gt; snapshots first and then streams, while &lt;code&gt;never&lt;/code&gt; only streams ongoing changes.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;connector.class&amp;quot;: &amp;quot;io.debezium.connector.postgresql.PostgresConnector&amp;quot;,
  &amp;quot;database.hostname&amp;quot;: &amp;quot;postgres.prod.internal&amp;quot;,
  &amp;quot;database.port&amp;quot;: &amp;quot;5432&amp;quot;,
  &amp;quot;database.user&amp;quot;: &amp;quot;debezium_user&amp;quot;,
  &amp;quot;database.password&amp;quot;: &amp;quot;${file:/run/secrets/postgres-creds:password}&amp;quot;,
  &amp;quot;database.dbname&amp;quot;: &amp;quot;production&amp;quot;,
  &amp;quot;database.server.name&amp;quot;: &amp;quot;prod_postgres&amp;quot;,
  &amp;quot;table.include.list&amp;quot;: &amp;quot;public.orders,public.customers,public.products&amp;quot;,
  &amp;quot;plugin.name&amp;quot;: &amp;quot;pgoutput&amp;quot;,
  &amp;quot;slot.name&amp;quot;: &amp;quot;debezium_prod&amp;quot;,
  &amp;quot;snapshot.mode&amp;quot;: &amp;quot;initial&amp;quot;,
  &amp;quot;decimal.handling.mode&amp;quot;: &amp;quot;double&amp;quot;,
  &amp;quot;heartbeat.interval.ms&amp;quot;: &amp;quot;10000&amp;quot;,
  &amp;quot;publication.name&amp;quot;: &amp;quot;dbz_publication&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;heartbeat.interval.ms&lt;/code&gt; setting is important for tables with low write volume. Without heartbeats, a replication slot in PostgreSQL can accumulate WAL logs indefinitely if no changes occur, potentially filling disk. Regular heartbeat events keep the replication slot position advancing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Schema Migration in CDC Pipelines&lt;/h2&gt;
&lt;p&gt;One of the most operationally challenging aspects of CDC pipelines is handling source database schema changes. When a developer adds a column to the &lt;code&gt;orders&lt;/code&gt; table in PostgreSQL, several things need to happen:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Debezium detects the schema change from the DDL event in the WAL&lt;/li&gt;
&lt;li&gt;The Kafka topic schema (if using Schema Registry) must be updated&lt;/li&gt;
&lt;li&gt;The Flink consumer must handle the new column in incoming events&lt;/li&gt;
&lt;li&gt;The Iceberg target table must be updated with the new column&lt;/li&gt;
&lt;li&gt;Historical rows in the Iceberg table will have NULL for the new column&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Iceberg&apos;s schema evolution makes step 4 non-destructive. Adding a new optional column to an Iceberg table is metadata-only; no data files are rewritten, and the column shows as NULL for all historical rows. This is the same guarantee that enables the CDC schema migration flow to be automated.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Automatically apply schema changes from Debezium to Iceberg
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType

catalog = load_catalog(&amp;quot;polaris&amp;quot;, **{&amp;quot;uri&amp;quot;: &amp;quot;https://catalog.example.com&amp;quot;})
table = catalog.load_table(&amp;quot;prod_replica.orders&amp;quot;)

# Add a new column without rewriting data files
with table.update_schema() as update:
    update.add_column(
        path=&amp;quot;shipping_carrier&amp;quot;,  # New column added to source
        field_type=StringType(),
        required=False  # Always optional for schema migration safety
    )
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This automated schema migration pattern (detecting DDL changes from Debezium, applying them to the Iceberg schema via the PyIceberg API), allows the CDC pipeline to self-heal after schema changes without manual intervention.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Flink Checkpoint Configuration for CDC Reliability&lt;/h2&gt;
&lt;p&gt;Flink checkpointing is the mechanism that makes CDC pipelines resumable after failures. Without proper checkpoint configuration, a Flink job failure requires reprocessing from the beginning of the Kafka topic, which for high-volume tables means hours of catch-up processing.&lt;/p&gt;
&lt;p&gt;The critical Flink checkpoint settings for Iceberg CDC pipelines:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# flink-conf.yaml for Iceberg CDC reliability
execution.checkpointing.interval: 60s # Checkpoint every 60 seconds
execution.checkpointing.min-pause: 30s # Minimum time between checkpoints
execution.checkpointing.timeout: 300s # Checkpoint must complete within 5 minutes
execution.checkpointing.max-concurrent-checkpoints: 1
state.backend: rocksdb # RocksDB for large-state CDC
state.backend.incremental: true # Incremental RocksDB checkpoints
state.checkpoints.dir: s3://checkpoints/flink/ # Checkpoint storage in S3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Iceberg Flink connector commits data files to the Iceberg catalog at checkpoint time. This means that exactly-once semantics in the pipeline correspond to checkpoint frequency, a 60-second checkpoint interval means the pipeline can be at most 60 seconds behind the latest committed snapshot in Iceberg. This is typically acceptable for analytical workloads but may need tuning for near-real-time requirements.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Go Deeper on Iceberg and Lakehouse Architecture&lt;/h3&gt;
&lt;p&gt;For a comprehensive treatment of Apache Iceberg, open table format design, and modern lakehouse patterns, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides native Iceberg V3 query support with automated reflection acceleration across your Iceberg tables. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Kafka 4.0 Changes Streaming Platform Operations</title><link>https://iceberglakehouse.com/posts/2026-05-24-kafka-streaming-operations/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-kafka-streaming-operations/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-kafka-streaming-operations/).

A...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-kafka-streaming-operations/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Apache Kafka 4.0 shipped on March 18, 2025, and it made one thing official: ZooKeeper is gone. Not deprecated, not optional, removed. Every new Kafka 4.0 cluster runs in KRaft mode. If your team still runs ZooKeeper-based brokers, you cannot do an in-place upgrade to 4.0. That&apos;s the short version of what changed.&lt;/p&gt;
&lt;p&gt;The longer version is more interesting. Kafka 4.0 also marks two other operational milestones. The new consumer rebalance protocol (KIP-848) is now generally available, replacing the &amp;quot;stop-the-world&amp;quot; rebalance behavior that has caused consumer lag spikes for years. And Queues for Kafka (KIP-932), which enables point-to-point messaging semantics on top of Kafka topics, entered early access.&lt;/p&gt;
&lt;p&gt;Together, these changes rewrite several operating model assumptions that platform teams have held since 2015. Here&apos;s what they mean in practice.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The End of ZooKeeper: What KRaft Actually Changes&lt;/h2&gt;
&lt;p&gt;ZooKeeper served Kafka as a distributed coordination service. Kafka brokers used it to store cluster metadata, conduct leader elections for partition controllers, and track consumer group state. Every Kafka operator knew the drill: you didn&apos;t just run Kafka, you ran Kafka plus a ZooKeeper ensemble, monitored both, and managed the dependency chain between them.&lt;/p&gt;
&lt;p&gt;KRaft, which stands for Kafka Raft, replaces ZooKeeper by embedding the consensus and metadata management directly in the Kafka broker process. A quorum of Kafka brokers act as controllers, storing metadata in an internal Raft-replicated log rather than in a separate ZooKeeper cluster. One broker holds the active controller role; the others replicate its log and are ready to take over if the controller fails.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/kafka-streaming-operations/kafka-zookeeper-vs-kraft-architecture.png&quot; alt=&quot;Architecture comparison showing ZooKeeper-based Kafka 3.x with separate ZooKeeper cluster versus KRaft-based Kafka 4.0 with integrated Raft quorum, reducing two systems to one&quot;&gt;&lt;/p&gt;
&lt;p&gt;The operational implications are substantial.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure reduction.&lt;/strong&gt; A production ZooKeeper ensemble typically requires three or five nodes, each with separate monitoring, patching, and disk management. In KRaft mode, those nodes disappear. You manage one system instead of two. For teams running Kafka on Kubernetes, this removes several StatefulSet configurations, PersistentVolumeClaims, and service accounts from your deployment manifests.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Faster controller failover.&lt;/strong&gt; In ZooKeeper-based Kafka, a controller failover triggered a ZooKeeper session timeout, which could take 18 to 30 seconds under default configurations before the election completed and a new controller began serving metadata. KRaft uses a heartbeat-based leader detection mechanism with typical election times under 5 seconds in most environments.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Higher partition limits.&lt;/strong&gt; ZooKeeper stored partition metadata in memory, which capped practical cluster limits at around 200,000 partitions before memory pressure and election latency became problematic. KRaft&apos;s metadata log approach scales to millions of partitions on the same hardware. For teams running high-fanout event platforms with many small topics, this removes a hard architectural ceiling.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What doesn&apos;t change.&lt;/strong&gt; Your producer and consumer code still works the same way. Your topics, partitions, and consumer groups remain intact after migration. The client-facing API is backward compatible. The operational change is entirely on the broker and infrastructure side.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Upgrade Path: Why You Can&apos;t Jump Directly to 4.0&lt;/h2&gt;
&lt;p&gt;This is where most teams will need careful planning. If your cluster currently runs in ZooKeeper mode on Kafka 2.x or 3.x, you cannot upgrade directly to Kafka 4.0. The 4.0 broker binary includes no ZooKeeper client libraries at all. Attempting to start a 4.0 broker against a ZooKeeper-based cluster will fail on startup.&lt;/p&gt;
&lt;p&gt;The supported path has two stages. First, you migrate your existing cluster to KRaft mode while still on Kafka 3.7 or later. The 3.x releases include a built-in ZooKeeper-to-KRaft migration tool that converts an existing cluster&apos;s metadata in place while the cluster remains live. Second, once the cluster runs in KRaft mode, you upgrade broker versions to 4.0.&lt;/p&gt;
&lt;p&gt;For Amazon MSK users, the in-place conversion path is not available. AWS MSK does not support converting a ZooKeeper-based MSK cluster to KRaft. The supported migration approach is a parallel cluster strategy:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Provision a new MSK cluster in KRaft mode.&lt;/li&gt;
&lt;li&gt;Use MirrorMaker 2 to replicate your topic configurations and consumer group offsets to the new cluster.&lt;/li&gt;
&lt;li&gt;Update your producer and consumer applications to point to the new cluster&apos;s bootstrap brokers.&lt;/li&gt;
&lt;li&gt;Validate offset continuity and run both clusters in parallel for a burn-in period.&lt;/li&gt;
&lt;li&gt;Shift traffic progressively: 20%, then 50%, then 100%.&lt;/li&gt;
&lt;li&gt;Decommission the original cluster after validating no data loss.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For Confluent Cloud users, the managed platform handles KRaft internally. If you use Confluent Cloud, you&apos;re already running on KRaft. The upgrade path concern applies to self-managed clusters.&lt;/p&gt;
&lt;p&gt;One more constraint: Kafka 4.0 raises the minimum Java version requirements. Brokers, Kafka Connect workers, and command-line tools now require Java 17. Kafka client libraries and Kafka Streams applications require Java 11. If your application stack still runs on Java 8 or Java 11 for broker processes, that must be resolved before the upgrade.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/kafka-streaming-operations/kafka-40-migration-roadmap.png&quot; alt=&quot;Six-step Kafka 4.0 migration roadmap from auditing the current ZooKeeper setup through MirrorMaker 2 replication and gradual traffic cutover&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;KIP-848: The New Consumer Group Protocol&lt;/h2&gt;
&lt;p&gt;The second major change in Kafka 4.0 is the general availability of the new consumer rebalance protocol, KIP-848. To understand why this matters, you need to understand what was wrong with the old one.&lt;/p&gt;
&lt;p&gt;The classic rebalance protocol is sometimes called &amp;quot;stop-the-world&amp;quot; because that&apos;s what it does. When a consumer joins or leaves a group (or when a consumer&apos;s heartbeat times out), the group coordinator triggers a full group rebalance. Every consumer in the group stops processing, revokes its current partition assignments, and waits for the group leader (a client-side process) to compute a new assignment and distribute it to all members through the coordinator. Only after all members acknowledge the new assignment does processing resume.&lt;/p&gt;
&lt;p&gt;For a consumer group with ten members, adding an eleventh member pauses all ten existing members for the duration of the rebalance. In practice, this can pause processing for seconds to tens of seconds, depending on the number of partitions, the complexity of the assignment strategy, and network round-trip times between consumers and the broker.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/kafka-streaming-operations/kafka-classic-vs-kip848-rebalance.png&quot; alt=&quot;Timeline comparison showing classic protocol stop-the-world rebalance versus KIP-848 incremental heartbeat-based reassignment with minimal processing disruption&quot;&gt;&lt;/p&gt;
&lt;p&gt;KIP-848 moves the assignment logic from the client to the broker. Under the new protocol:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Consumers send heartbeats that describe their current partition assignments and capabilities.&lt;/li&gt;
&lt;li&gt;The broker-side group coordinator computes assignment changes incrementally without requiring all consumers to revoke their current partitions simultaneously.&lt;/li&gt;
&lt;li&gt;Only the partitions being reassigned are affected. Consumers holding partitions that don&apos;t need to move continue processing throughout the rebalance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical result is that adding a consumer to a group with 100 partitions no longer pauses all 99 other partitions&apos; processing. The coordinator moves partitions incrementally, one or a few at a time, with no group-wide pause.&lt;/p&gt;
&lt;h3&gt;Enabling KIP-848&lt;/h3&gt;
&lt;p&gt;The new protocol is not enabled by default for existing consumer applications. To opt in, set this consumer configuration property:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-properties&quot;&gt;group.protocol=consumer
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The broker supports both the classic protocol and the new protocol simultaneously. A consumer group can mix consumers using both protocols during a rolling restart, which enables gradual migration without coordinated downtime. Once all consumers in a group are restarted with the new configuration, the group transitions fully to the KIP-848 protocol.&lt;/p&gt;
&lt;p&gt;Several configuration properties that consumers previously managed client-side are now handled by the broker under KIP-848. Specifically, &lt;code&gt;group.consumer.session.timeout.ms&lt;/code&gt;, &lt;code&gt;group.consumer.heartbeat.interval.ms&lt;/code&gt;, and the assignor configuration are now server-side settings. Clients provide &lt;code&gt;rebalance.timeout.ms&lt;/code&gt; (derived from &lt;code&gt;max.poll.interval.ms&lt;/code&gt;) to tell the broker how long they need to revoke partitions safely, but the assignment computation itself no longer happens on the client.&lt;/p&gt;
&lt;p&gt;If you use Kafka Streams, note that Streams has a separate roadmap (KIP-1071) for adopting the new protocol. Do not enable &lt;code&gt;group.protocol=consumer&lt;/code&gt; for Kafka Streams applications until the Streams version you&apos;re running explicitly supports it.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Removed APIs and Protocol Changes&lt;/h2&gt;
&lt;p&gt;Kafka 4.0 also removes several legacy components that were deprecated in earlier releases. This is the category most likely to break existing integrations without warning if you skip the compatibility check.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Removed message formats.&lt;/strong&gt; Message formats v0 and v1, deprecated in Kafka 3.0, are no longer present in 4.0. These were the original binary formats from Kafka&apos;s earliest versions. Clients using these formats will fail to produce or consume messages against a 4.0 broker.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Removed old API versions.&lt;/strong&gt; Kafka&apos;s protocol uses versioned RPCs. Old API versions deprecated in 3.x are removed in 4.0. Most modern clients (librdkafka 1.9+, Java client 3.0+, Python confluent-kafka 1.9+) handle protocol version negotiation automatically and will work fine. Clients that hardcode specific API versions may fail.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Log4j to Log4j2.&lt;/strong&gt; Kafka&apos;s internal logging framework migrated from Log4j 1.x to Log4j2. Custom logging configurations in &lt;code&gt;log4j.properties&lt;/code&gt; format need to be migrated to the Log4j2 XML or YAML format. The old file is ignored silently, which means logging configuration changes don&apos;t take effect unless you notice and convert the file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Queues for Kafka (KIP-932).&lt;/strong&gt; This feature, in early access, enables point-to-point queue semantics where a message is consumed by exactly one consumer, rather than being broadcast to all subscribers in a partition-based group. For teams building task queue patterns on top of Kafka, this removes the need for external workarounds like single-partition topics or external coordination. Early access means the API may change before it stabilizes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What This Means for Platform Operations&lt;/h2&gt;
&lt;p&gt;The combination of KRaft, KIP-848, and removed ZooKeeper dependency simplifies the operational surface of a streaming platform in meaningful ways.&lt;/p&gt;
&lt;p&gt;Your monitoring stack needs to change. ZooKeeper-specific metrics (&lt;code&gt;/brokers/ids&lt;/code&gt;, ensemble latency, leader election counts) disappear. KRaft introduces its own metrics through the &lt;code&gt;kafka.controller&lt;/code&gt; and &lt;code&gt;kafka.raft&lt;/code&gt; metric namespaces, which track Raft quorum health, metadata log lag, and controller election timing. Update your Prometheus scrapers, Grafana dashboards, or DataDog monitors before upgrading.&lt;/p&gt;
&lt;p&gt;Kafka Connect workers are largely unaffected by the KRaft transition. Connect uses consumer groups internally and the broker API; the ZooKeeper removal doesn&apos;t change the Connect worker&apos;s behavior. The Connect REST API remains the same. The only Connect-related change is the Java 17 requirement for the worker process itself.&lt;/p&gt;
&lt;p&gt;Schema Registry, ksqlDB, and other Confluent Platform components that store metadata in Kafka topics (rather than ZooKeeper) are also mostly unaffected from an operational standpoint. Check version compatibility tables for each component before upgrading the broker.&lt;/p&gt;
&lt;p&gt;The clearest near-term win from 4.0 for most teams is not the architectural change but the operational simplification. Running one fewer distributed system (with its own leader election, connection pooling, Jute serialization format, and 4-letter word commands), reduces the number of things that can fail at 2 a.m.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Kafka 4.0 marks the end of a decade-long dependency on ZooKeeper. The migration path is well-defined, but it requires advance planning: you cannot jump directly from a ZooKeeper-based cluster to 4.0 without first converting to KRaft on 3.x. On MSK, that means a parallel cluster migration. On self-managed clusters, the built-in migration tool in 3.7+ handles the conversion.&lt;/p&gt;
&lt;p&gt;Plan for three specific compatibility items before starting: Java version requirements, removed legacy API versions, and the log4j configuration format change. None of them are blockers, but all three will cause silent failures if you don&apos;t check them ahead of time.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Producer Configuration for Throughput vs. Reliability&lt;/h2&gt;
&lt;p&gt;Kafka producer configuration involves a fundamental tradeoff: higher throughput settings reduce reliability guarantees, and higher reliability settings reduce throughput. Understanding which side of this tradeoff your workload needs is essential for correct configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;High-throughput, tolerant of some data loss (metrics, telemetry):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-properties&quot;&gt;acks=1                          # Leader acknowledges only
batch.size=131072               # 128 KB batches
linger.ms=20                    # Wait up to 20ms to fill batch
compression.type=lz4            # Fast compression
enable.idempotence=false        # No dedup overhead
max.in.flight.requests.per.connection=5
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Exactly-once semantics (financial transactions, CDC events):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-properties&quot;&gt;acks=all                        # All in-sync replicas acknowledge
enable.idempotence=true         # Deduplicate retries at the broker
transactional.id=my-producer-001  # Enable transactions
transaction.timeout.ms=60000    # 60 second transaction window
max.in.flight.requests.per.connection=5  # Required with idempotence
batch.size=65536               # 64 KB batches (smaller for latency)
linger.ms=5                    # Short linger for latency
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;enable.idempotence=true&lt;/code&gt; setting ensures that retried producer sends don&apos;t create duplicate messages. The broker assigns each producer a unique PID (Producer ID) and sequence numbers to each message, allowing it to detect and discard duplicates. For CDC pipelines and financial event streams, idempotent producers are essential.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Consumer Lag Monitoring in Production&lt;/h2&gt;
&lt;p&gt;Consumer lag (the gap between the latest offset in a Kafka partition and the consumer group&apos;s current committed offset), is the most important operational metric for streaming platforms. Growing consumer lag means the pipeline is falling behind incoming data. Without monitoring and alerting on consumer lag, you may not discover a falling pipeline until business logic downstream has been starved for hours.&lt;/p&gt;
&lt;p&gt;The standard tools for consumer lag monitoring:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prometheus JMX Exporter + Kafka Exporter:&lt;/strong&gt; The &lt;code&gt;kafka_consumergroup_lag&lt;/code&gt; metric exposed through the Kafka Exporter is the simplest path. Set alerts when lag exceeds thresholds per consumer group and topic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# Alertmanager rule for consumer lag
groups:
  - name: kafka_consumer_lag
    rules:
      - alert: KafkaConsumerGroupHighLag
        expr: kafka_consumergroup_lag{consumergroup=&amp;quot;analytics-events-processor&amp;quot;} &amp;gt; 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: &amp;quot;Consumer group {{ $labels.consumergroup }} is lagging&amp;quot;
          description: &amp;quot;Lag of {{ $value }} messages on topic {{ $labels.topic }}, partition {{ $labels.partition }}&amp;quot;

      - alert: KafkaConsumerGroupCriticalLag
        expr: kafka_consumergroup_lag{consumergroup=&amp;quot;analytics-events-processor&amp;quot;} &amp;gt; 1000000
        for: 5m
        labels:
          severity: critical
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Lag-in-seconds vs. lag-in-messages:&lt;/strong&gt; Message count lag is misleading when message sizes vary significantly. A lag of 100,000 small metrics events is very different from a lag of 100,000 10-KB transaction records. When possible, combine message lag with throughput metrics to estimate time-to-recovery:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def estimate_catchup_time(current_lag_messages, consumer_throughput_msgs_per_sec, producer_throughput_msgs_per_sec):
    &amp;quot;&amp;quot;&amp;quot;Estimate time for a consumer to catch up with a given lag.&amp;quot;&amp;quot;&amp;quot;
    net_catchup_rate = consumer_throughput_msgs_per_sec - producer_throughput_msgs_per_sec
    if net_catchup_rate &amp;lt;= 0:
        return float(&apos;inf&apos;)  # Consumer can&apos;t catch up at current rates
    return current_lag_messages / net_catchup_rate  # Returns seconds to catch up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If &lt;code&gt;estimate_catchup_time()&lt;/code&gt; returns infinity (the consumer isn&apos;t keeping up with the producer even without the backlog), the issue isn&apos;t lag, it&apos;s consumer throughput. Adding more consumer instances or optimizing the processing logic is the correct intervention, not simply monitoring the lag number.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Kafka and the Streaming Lakehouse Stack&lt;/h2&gt;
&lt;p&gt;Kafka rarely operates in isolation. It&apos;s typically one component in a streaming data pipeline that ends in a lakehouse: Kafka → Flink/Spark Structured Streaming → Iceberg tables → BI and ML workloads.&lt;/p&gt;
&lt;p&gt;The reliability properties at each layer interact. Kafka provides ordered, persistent event streams with configurable retention. Flink provides exactly-once stateful processing with Kafka offset checkpointing. Iceberg provides ACID table commits with snapshot isolation. Together, these three systems provide an end-to-end exactly-once guarantee from source events to lakehouse tables.&lt;/p&gt;
&lt;p&gt;Understanding which component is the bottleneck at each scale level helps teams debug and optimize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Kafka throughput limited:&lt;/strong&gt; Add partitions, scale brokers, tune producer batching&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink processing limited:&lt;/strong&gt; Scale task slots, parallelize operators, optimize state backends&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iceberg commit limited:&lt;/strong&gt; Increase checkpoint intervals, reduce micro-batch frequency, tune file sizes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S3/GCS I/O limited:&lt;/strong&gt; Use multipart uploads, tune buffer sizes, consider S3 Tables for managed I/O&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Operational observability across this entire stack is what OpenLineage (for lineage) and Prometheus/Grafana (for metrics) enable, providing a unified view of where the streaming pipeline stands at any given moment.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Go Deeper on Streaming Data Architecture&lt;/h3&gt;
&lt;p&gt;For a comprehensive treatment of streaming lakehouses, open table formats, and real-time pipelines, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To query Kafka-sourced Iceberg tables with sub-second performance and automated query acceleration, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Lance and Iceberg for Multimodal AI Data</title><link>https://iceberglakehouse.com/posts/2026-05-24-lance-iceberg-multimodal/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-lance-iceberg-multimodal/</guid><description>
Apache Iceberg was designed for analytical workloads: columnar scans, partition pruning, SQL aggregations. It&apos;s excellent at returning the answer to ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Apache Iceberg was designed for analytical workloads: columnar scans, partition pruning, SQL aggregations. It&apos;s excellent at returning the answer to &amp;quot;what was the average revenue by region for the last 30 days?&amp;quot; and poor at answering &amp;quot;give me the 500 training images most similar to this query image.&amp;quot;&lt;/p&gt;
&lt;p&gt;The second question is random access retrieval from an embedding index, a fundamentally different access pattern. Columnar storage optimized for scan performance is inefficient for retrieving arbitrary rows by vector similarity. Iceberg tables store Parquet files, and Parquet files are optimized for column projection and predicate pushdown, not random row access.&lt;/p&gt;
&lt;p&gt;This is where LanceDB and the Lance format fill a gap. Lance is a columnar format designed for both scan-efficient analytics (like Parquet) and random-access retrieval (unlike Parquet). It builds IVF-PQ vector indexes natively on disk, without requiring vectors to fit in RAM. Combined with Iceberg for structured metadata and SQL analytics, Lance enables a complete multimodal AI data architecture.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Two Patterns That Don&apos;t Fit Together&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Scan-heavy analytical queries&lt;/strong&gt; (aggregations, group-bys, time-window analytics, joins), are what Iceberg is built for. The underlying Parquet files store column values contiguously, enabling vectorized scan operations that process entire columns in cache-friendly chunks. Partition pruning eliminates entire file groups based on metadata. Predicate pushdown moves filters into the file reading layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Random-access retrieval&lt;/strong&gt; (fetching 500 specific rows from a 10-million-row dataset based on vector similarity), breaks the columnar scan model. To retrieve a specific row in a Parquet file, you must read at minimum the row group containing that row, even if you only want one record. At scale, random access across an Iceberg table degrades into many expensive small reads.&lt;/p&gt;
&lt;p&gt;ML training workloads require random access at scale: sample 256 images from 10 million for a training batch, read specific samples from disk without materializing the full dataset in memory, and iterate over shuffled samples across epochs without loading everything into RAM.&lt;/p&gt;
&lt;p&gt;The Lance format was designed for exactly this workload. Its on-disk layout supports random access to individual rows with low read amplification. Combined with its IVF-PQ vector index (which is disk-native and doesn&apos;t require vectors to be in RAM) Lance is the format of choice for embedding storage and training data retrieval.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Complementary Architecture&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/lance-iceberg-multimodal/lance-iceberg-multimodal-architecture.png&quot; alt=&quot;Lance and Iceberg multimodal AI lakehouse architecture showing object store for raw blobs, Iceberg tables for structured metadata with SQL analytics, and LanceDB for embeddings with vector search, all fed by a multimodal ingestion pipeline using CLIP, Whisper, and custom encoders&quot;&gt;&lt;/p&gt;
&lt;p&gt;The production architecture uses Iceberg and Lance together:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Iceberg tables&lt;/strong&gt; hold structured metadata about each media asset: content ID, source URL, creation timestamp, labels, annotation status, split assignment (train/val/test), and any tabular features. SQL queries against this metadata are fast: find all training images from source X with label Y added after date Z, joining against annotation tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LanceDB / Lance tables&lt;/strong&gt; hold the embeddings and enable vector retrieval: given a query image embedding, find the 50 most semantically similar training examples. The Lance table stores the embedding alongside a pointer to the object store location (S3 URL or file path) so the actual image bytes can be fetched directly after retrieval.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Object store (S3/GCS/ABS)&lt;/strong&gt; holds the raw media files. Neither Iceberg nor Lance tries to store raw images or video; object storage is the right layer for blobs. Both table formats store references to the object store.&lt;/p&gt;
&lt;p&gt;The multimodal ingestion pipeline ties these together: when new media arrives, it gets stored in object storage, its embedding is computed (CLIP for images and video, Whisper for audio), the embedding is written to the Lance table, and the structured metadata is written to the Iceberg table.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Working with Lance and DuckDB&lt;/h2&gt;
&lt;p&gt;LanceDB integrates with DuckDB, allowing SQL queries against Lance tables. This enables joining Lance embedding data with Iceberg metadata without separate ETL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import lancedb
import duckdb

# Connect to LanceDB
db = lancedb.connect(&amp;quot;s3://my-bucket/lancedb/&amp;quot;)
images_table = db.open_table(&amp;quot;training_images&amp;quot;)

# Query: find similar images to a reference image
similar_images = (
    images_table.search(reference_embedding)
    .metric(&amp;quot;cosine&amp;quot;)
    .limit(100)
    .to_pandas()
)

# Join with Iceberg metadata via DuckDB for filtered retrieval
conn = duckdb.connect()
conn.execute(&amp;quot;INSTALL iceberg; LOAD iceberg;&amp;quot;)

# Register the pandas DataFrame (from Lance) for DuckDB querying
conn.register(&amp;quot;similar_images&amp;quot;, similar_images)

# Join with Iceberg metadata to filter by label and split
annotated_similar = conn.execute(&amp;quot;&amp;quot;&amp;quot;
    SELECT s.content_id, s.s3_uri, m.label, m.split
    FROM similar_images s
    JOIN iceberg_scan(&apos;s3://my-bucket/iceberg/image_metadata/&apos;) m
        ON s.content_id = m.content_id
    WHERE m.label IN (&apos;cat&apos;, &apos;dog&apos;)
      AND m.split = &apos;train&apos;
&amp;quot;&amp;quot;&amp;quot;).fetchdf()
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Workload Fit: When to Use Each&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/lance-iceberg-multimodal/lance-vs-iceberg-workload-fit-matrix.png&quot; alt=&quot;Lance vs Iceberg workload fit matrix comparing analytical SQL queries, training data retrieval, embedding/vector search, and structured+unstructured joins across both formats&quot;&gt;&lt;/p&gt;
&lt;p&gt;The choice between Lance and Iceberg is not either/or. The complementary architecture uses both:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL analytics on metadata:&lt;/strong&gt; Iceberg&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Random-access training sample retrieval:&lt;/strong&gt; Lance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Similarity search / nearest neighbor:&lt;/strong&gt; Lance (with IVF-PQ index)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data quality monitoring, annotation tracking:&lt;/strong&gt; Iceberg&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-format queries:&lt;/strong&gt; DuckDB joining both&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The operational complexity of running both formats is lower than it appears. Lance tables can be co-located in S3 alongside Iceberg tables. Both use object storage as the persistence layer. Catalog management for Lance tables can use LanceDB&apos;s own catalog API or integrate with Polaris/Nessie for unified catalog visibility.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;LanceDB: Beyond Embeddings&lt;/h2&gt;
&lt;p&gt;LanceDB&apos;s longer-term positioning is as an AI-native multimodal lakehouse, not just an embedding store. It supports storing raw blobs alongside vectors in the same table, enabling truly unified storage for AI datasets.&lt;/p&gt;
&lt;p&gt;In this model, a Lance table for a vision model training dataset might store: &lt;code&gt;image_bytes&lt;/code&gt; (raw PNG/JPEG), &lt;code&gt;embedding&lt;/code&gt; (1536-dim float vector), &lt;code&gt;label&lt;/code&gt;, &lt;code&gt;source_id&lt;/code&gt;, and &lt;code&gt;created_at&lt;/code&gt;. Retrieval is a single operation that returns both the embedding neighborhood and the raw image bytes, without an additional object storage fetch.&lt;/p&gt;
&lt;p&gt;For large-scale training datasets, this architecture offers better cache locality and simpler pipeline management than the separate Iceberg + object store + Lance design, at the cost of storing raw bytes in the table format rather than object storage.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg and LanceDB/Lance serve different access patterns in a multimodal AI lakehouse. Iceberg handles structured analytics, SQL governance, and metadata management. Lance handles random-access retrieval, vector search, and training data pipelines. The optimal architecture uses both, with Iceberg and Lance tables co-located in object storage and DuckDB providing the SQL bridge between them.&lt;/p&gt;
&lt;p&gt;For teams building AI training infrastructure in 2026, defaulting to &amp;quot;Iceberg for everything&amp;quot; creates unnecessary performance bottlenecks in the training data retrieval path. Adding Lance tables for embedding and blob storage is low-friction and high-impact.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Building the Multimodal Ingestion Pipeline&lt;/h2&gt;
&lt;p&gt;The ingestion pipeline that populates both Iceberg and Lance tables needs to handle several concerns simultaneously:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Blob storage:&lt;/strong&gt; Upload raw media (images, audio, video) to object storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embedding computation:&lt;/strong&gt; Run the media through the appropriate encoder (CLIP for vision, Whisper for audio) to produce vector embeddings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lance write:&lt;/strong&gt; Append the embedding and metadata to the Lance table&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iceberg write:&lt;/strong&gt; Append structured metadata (labels, source, split, timestamps) to the Iceberg table&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Ensure that a content_id written to Lance is also written to Iceberg in the same ingestion batch&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import lancedb
import pyarrow as pa
import open_clip
import torch
from pyiceberg.catalog import load_catalog

def ingest_image_batch(image_paths: list[str], labels: list[str]):
    &amp;quot;&amp;quot;&amp;quot;
    Ingest a batch of images into the multimodal AI lakehouse.
    Writes embeddings to Lance and metadata to Iceberg.
    &amp;quot;&amp;quot;&amp;quot;
    # Load CLIP model for embedding computation
    model, _, preprocess = open_clip.create_model_and_transforms(&amp;quot;ViT-B-32&amp;quot;)

    # Compute embeddings
    embeddings = []
    content_ids = []
    for path in image_paths:
        image = preprocess(Image.open(path)).unsqueeze(0)
        with torch.no_grad():
            embedding = model.encode_image(image)
        content_id = compute_content_hash(path)
        embeddings.append(embedding.numpy().flatten())
        content_ids.append(content_id)

    # Upload to object storage and get S3 URIs
    s3_uris = upload_to_s3(image_paths, content_ids)

    # Write to Lance table
    lance_db = lancedb.connect(&amp;quot;s3://ai-lake/lancedb/&amp;quot;)
    lance_table = lance_db.open_table(&amp;quot;training_images&amp;quot;)
    lance_records = [
        {&amp;quot;content_id&amp;quot;: cid, &amp;quot;embedding&amp;quot;: emb, &amp;quot;s3_uri&amp;quot;: uri}
        for cid, emb, uri in zip(content_ids, embeddings, s3_uris)
    ]
    lance_table.add(lance_records)

    # Write metadata to Iceberg
    catalog = load_catalog(&amp;quot;polaris&amp;quot;, **{&amp;quot;uri&amp;quot;: &amp;quot;https://catalog.example.com&amp;quot;})
    iceberg_table = catalog.load_table(&amp;quot;ai_datasets.image_metadata&amp;quot;)

    metadata_records = pa.table({
        &amp;quot;content_id&amp;quot;: content_ids,
        &amp;quot;s3_uri&amp;quot;: s3_uris,
        &amp;quot;label&amp;quot;: labels,
        &amp;quot;split&amp;quot;: assign_split(content_ids),  # train/val/test assignment
        &amp;quot;ingested_at&amp;quot;: [datetime.utcnow().isoformat()] * len(content_ids),
        &amp;quot;embedding_model&amp;quot;: [&amp;quot;ViT-B-32&amp;quot;] * len(content_ids)
    })
    iceberg_table.append(metadata_records)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Versioning Training Datasets&lt;/h2&gt;
&lt;p&gt;One of the critical properties for reproducible ML training is dataset versioning: the ability to recreate the exact training set used for a specific model version. Iceberg provides this naturally through its snapshot mechanism.&lt;/p&gt;
&lt;p&gt;When you&apos;re ready to lock a training dataset for a specific model run, record the Iceberg snapshot ID:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow
from pyiceberg.catalog import load_catalog

catalog = load_catalog(&amp;quot;polaris&amp;quot;, **{&amp;quot;uri&amp;quot;: &amp;quot;https://catalog.example.com&amp;quot;})
iceberg_table = catalog.load_table(&amp;quot;ai_datasets.image_metadata&amp;quot;)

# Record current snapshot ID before training
current_snapshot = iceberg_table.current_snapshot()
snapshot_id = current_snapshot.snapshot_id

# Log to MLflow for training reproducibility
with mlflow.start_run() as run:
    mlflow.log_param(&amp;quot;training_dataset_table&amp;quot;, &amp;quot;ai_datasets.image_metadata&amp;quot;)
    mlflow.log_param(&amp;quot;training_dataset_snapshot_id&amp;quot;, snapshot_id)

    # Load training data from the specific snapshot for reproducibility
    training_metadata = iceberg_table.scan(snapshot_id=snapshot_id).to_arrow()
    training_content_ids = training_metadata[&amp;quot;content_id&amp;quot;].to_pylist()

    # Retrieve embeddings from Lance using the content IDs
    lance_db = lancedb.connect(&amp;quot;s3://ai-lake/lancedb/&amp;quot;)
    lance_table = lance_db.open_table(&amp;quot;training_images&amp;quot;)

    # Filter Lance table to only the content IDs in the Iceberg snapshot
    training_data = lance_table.search() \
        .where(f&amp;quot;content_id IN {tuple(training_content_ids[:100])}&amp;quot;) \
        .to_pandas()

    # Train model
    model = train_vision_model(training_data)
    mlflow.pytorch.log_model(model, &amp;quot;model&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Six months later, a team investigating why model v5 had better performance than model v7 can retrieve the exact training data composition for each run using the recorded snapshot IDs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Fine-Tuning Workflow Patterns&lt;/h2&gt;
&lt;p&gt;The Iceberg + Lance architecture particularly shines for fine-tuning workflows, where you start from a pretrained model and adapt it to a specific domain using a curated subset of your training data.&lt;/p&gt;
&lt;p&gt;The fine-tuning dataset selection query uses Iceberg&apos;s SQL capabilities:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;```sql
-- Select high-quality fine-tuning examples from Iceberg metadata
SELECT content_id, s3_uri, label
FROM iceberg.ai_datasets.image_metadata
WHERE label IN (&apos;product_photo&apos;, &apos;lifestyle_photo&apos;)
  AND annotation_quality_score &amp;gt;= 4, Expert-annotated examples only
  AND split = &apos;train&apos;
  AND ingested_at &amp;gt;= &apos;2024-01-01&apos;, Recent, high-quality additions only
LIMIT 50000;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
The query results identify which content IDs to retrieve from Lance for embedding-based curriculum learning (training on the hardest examples first, then easy examples), or for diverse sampling (using vector clustering in Lance to ensure diverse coverage of the fine-tuning distribution).

This SQL-to-Lance bridge (using Iceberg SQL to select training example metadata, then using Lance vector retrieval to access the embedding and raw data), is the core pattern of a multimodal fine-tuning pipeline that doesn&apos;t require loading tens of millions of embeddings into memory.

---

## LanceDB in Production: Cloud and Self-Hosted Options

LanceDB operates in two deployment modes. The embedded mode runs the entire database in-process, no separate server, no network overhead. This is the mode used in the code examples throughout this post and is appropriate for single-machine workloads like a model training server or a batch embedding pipeline.

For production systems that need shared access from multiple processes or distributed environments, LanceDB Cloud provides a managed serverless option. The client API is identical to the embedded mode; you point the connection URI at the cloud endpoint instead of a local path:

```python
import lancedb

# Embedded mode (local development, training nodes)
db = lancedb.connect(&amp;quot;s3://my-bucket/lancedb/&amp;quot;)

# LanceDB Cloud (shared multi-process access)
db = lancedb.connect(
    &amp;quot;db://my-org-name&amp;quot;,
    api_key=&amp;quot;lancedb_api_key_here&amp;quot;,
    region=&amp;quot;us-east-1&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The operational difference is significant for ML infrastructure. In embedded mode on S3, multiple training jobs reading the same Lance table simultaneously can conflict on file access. LanceDB Cloud provides the coordination layer that makes concurrent read and write safe. For training pipelines where several GPU nodes read from the same embedding store during distributed training, LanceDB Cloud is the appropriate deployment target.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Lance vs. Dedicated Vector Databases&lt;/h2&gt;
&lt;p&gt;Teams evaluating the Lance/Iceberg combination often ask how it compares to dedicated vector databases like Pinecone, Weaviate, or Qdrant. The comparison depends entirely on what you&apos;re optimizing for.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dedicated vector databases&lt;/strong&gt; (Pinecone, Qdrant, Weaviate) are built specifically for vector similarity search and optimize aggressively for low-latency single-vector retrieval. They typically offer hosted APIs, built-in metadata filtering, and management dashboards that reduce operational overhead. For production RAG systems where the primary workload is real-time question answering with sub-100ms retrieval latency requirements, dedicated vector databases have proven operational track records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LanceDB&lt;/strong&gt; optimizes for the training data use case. Its columnar storage model, Arrow-native memory format, and S3-compatible storage make it efficient for batch retrieval patterns, retrieving thousands to millions of embeddings at once for training, evaluation, or similarity analysis. The trade-off is that real-time query latency is not its primary design target.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The practical decision rule:&lt;/strong&gt; If your primary use case is serving a production chatbot or search API where individual queries need sub-50ms vector lookup, a dedicated vector database or managed vector service (Vertex AI Matching Engine, Azure AI Search vector fields) is the operationally simpler choice. If your primary use case is training data management, embedding storage at scale, and dataset versioning for model development, LanceDB&apos;s native integration with the Python ML ecosystem and its Arrow-based columnar model make it the better fit.&lt;/p&gt;
&lt;p&gt;Many organizations end up using both: a dedicated vector database for production retrieval serving and LanceDB or Lance files for training data management. These aren&apos;t competing choices; they serve different points in the ML lifecycle.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Versioning Training Datasets with Lance and Iceberg Snapshots&lt;/h2&gt;
&lt;p&gt;One of the most valuable operational features of the Lance/Iceberg combination for ML teams is reproducible dataset versioning. Reproducibility in model training requires being able to reconstruct the exact dataset used to train any given model version.&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s snapshot ID is the natural version anchor. Every model training run can record the Iceberg snapshot ID of each table it queried at training start:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow
from pyiceberg.catalog import load_catalog

catalog = load_catalog(&amp;quot;polaris&amp;quot;, **catalog_config)
annotations = catalog.load_table(&amp;quot;training.multimodal_annotations&amp;quot;)

# Record the snapshot ID used for this training run
training_snapshot_id = annotations.current_snapshot().snapshot_id
mlflow.log_param(&amp;quot;iceberg_snapshot_id&amp;quot;, training_snapshot_id)

# Dataset construction proceeds from this specific snapshot
selected_ids = spark.read.format(&amp;quot;iceberg&amp;quot;) \
    .option(&amp;quot;snapshot-id&amp;quot;, training_snapshot_id) \
    .table(&amp;quot;training.multimodal_annotations&amp;quot;) \
    .filter(&amp;quot;annotation_quality_score &amp;gt;= 4 AND split = &apos;train&apos;&amp;quot;) \
    .select(&amp;quot;content_id&amp;quot;) \
    .collect()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Six months later, when a production model regression is reported, the training team can load the same snapshot and reconstruct the exact training set that produced the model, enabling them to compare against the current data distribution and identify what changed.&lt;/p&gt;
&lt;p&gt;Lance files are versioned implicitly through their S3 paths and the LanceDB table versions. Recording both the Iceberg snapshot ID and the LanceDB table version in the experiment metadata creates a complete, reproducible reference to the training dataset.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Go Deeper on AI-Native Data Architecture&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on lakehouse architecture for AI workloads, open table formats, and agentic AI integration, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides unified SQL access to your Iceberg lakehouse with query acceleration and multi-engine governance. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Bringing MLflow and Data Pipelines Closer Together</title><link>https://iceberglakehouse.com/posts/2026-05-24-mlflow-data-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-mlflow-data-pipelines/</guid><description>
The boundary between data engineering and ML engineering has always been somewhat artificial. A model degrades in production. Is it a model problem? ...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The boundary between data engineering and ML engineering has always been somewhat artificial. A model degrades in production. Is it a model problem? The data feeding it changed. Is it a data pipeline problem? The features it receives don&apos;t match what it was trained on. Is it a feature store problem? These questions point to the same underlying issue: the observability tools for data pipelines and the observability tools for ML models are separate, making cross-boundary diagnosis difficult.&lt;/p&gt;
&lt;p&gt;MLflow 3, released in 2025, moved toward addressing this by expanding its scope beyond experiment tracking into GenAI tracing, agent evaluation, and closer integration with data quality monitoring. Databricks&apos; Data Quality Monitoring feature provides a framework for applying model-style monitoring (drift detection, statistical distribution tracking), to datasets and pipeline outputs, not just model inference results.&lt;/p&gt;
&lt;p&gt;Together, these capabilities push toward a vision where data lineage, feature freshness, model performance, and inference quality are visible through a single observability surface rather than four separate tools.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;MLflow 3: What Changed&lt;/h2&gt;
&lt;p&gt;MLflow&apos;s original value proposition was experiment tracking for classic ML: log parameters, metrics, and artifacts for each training run, compare runs across experiments, promote the best run to a registered model. This remains the core, and MLflow 3 doesn&apos;t break it.&lt;/p&gt;
&lt;p&gt;What MLflow 3 adds is a significantly expanded observability scope:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GenAI tracing.&lt;/strong&gt; MLflow&apos;s &lt;code&gt;mlflow.tracing&lt;/code&gt; API captures the full execution trace of LLM calls, including prompts, completions, tool invocations, and latency at each step. For RAG pipelines and multi-agent systems, tracing shows exactly which retrieval steps and which LLM calls contributed to a final response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow

# Enable automatic tracing for LangChain or LlamaIndex
mlflow.langchain.autolog()

# Or manual tracing for custom pipelines
with mlflow.start_span(name=&amp;quot;document_retrieval&amp;quot;) as span:
    docs = vector_store.similarity_search(query, k=5)
    span.set_attribute(&amp;quot;num_docs_retrieved&amp;quot;, len(docs))
    span.set_attribute(&amp;quot;query&amp;quot;, query)

with mlflow.start_span(name=&amp;quot;llm_generation&amp;quot;) as span:
    response = llm.invoke(prompt)
    span.set_attribute(&amp;quot;model&amp;quot;, &amp;quot;gpt-4o&amp;quot;)
    span.set_attribute(&amp;quot;input_tokens&amp;quot;, response.usage.prompt_tokens)
    span.set_attribute(&amp;quot;output_tokens&amp;quot;, response.usage.completion_tokens)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Agent evaluation.&lt;/strong&gt; MLflow provides &lt;code&gt;mlflow.evaluate()&lt;/code&gt; with built-in metrics for RAG and agent workflows: answer relevance, faithfulness (does the answer reflect the retrieved context?), context recall, and hallucination detection. This brings the same experiment comparison discipline that works for classical ML metrics to GenAI quality evaluation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dataset tracking.&lt;/strong&gt; MLflow 3 extends the &lt;code&gt;mlflow.log_dataset()&lt;/code&gt; API to record not just the name and version of training datasets, but their statistical properties: row counts, column distributions, null rates. This creates a traceable link from training data quality to model performance.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Training Data Lineage in Practice&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/mlflow-data-pipelines/mlflow3-training-to-inference-lineage.png&quot; alt=&quot;MLflow 3 training data to inference lineage diagram showing Iceberg training features table to XGBoost model training tracked by MLflow run, to registered model, to production inference endpoint, with MLflow tracing layer below showing data quality alerts, experiment comparison, A/B testing, and drift detection&quot;&gt;&lt;/p&gt;
&lt;p&gt;The data pipeline integration starts at training time. When a model training run references a specific version of a training dataset, that reference should be recorded in MLflow alongside the model parameters and metrics. MLflow&apos;s dataset logging API connects to data sources including Iceberg tables, Delta tables, and pandas DataFrames:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow
import pandas as pd

with mlflow.start_run(run_name=&amp;quot;churn_v12_training&amp;quot;):
    # Log the training dataset with metadata
    training_data = load_from_iceberg(&amp;quot;training_features&amp;quot;, snapshot_id=102345)

    dataset = mlflow.data.from_pandas(
        training_data,
        source=&amp;quot;s3://data-lake/iceberg/training_features/&amp;quot;,
        name=&amp;quot;training_features&amp;quot;,
        targets=&amp;quot;is_churned&amp;quot;
    )
    mlflow.log_input(dataset, context=&amp;quot;training&amp;quot;)

    # Train model
    model = train_xgboost(training_data)

    # Log metrics and model
    mlflow.log_metric(&amp;quot;auc&amp;quot;, evaluate_auc(model, validation_data))
    mlflow.xgboost.log_model(model, &amp;quot;model&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now when investigating a model degradation, you can trace the MLflow run for the current production model, check which dataset snapshot it was trained on, compare the statistical profile of that snapshot against the current training features table, and identify whether the training distribution has drifted.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Data Quality Monitoring for ML Inputs&lt;/h2&gt;
&lt;p&gt;Databricks&apos; Data Quality Monitoring applies model-style monitoring to datasets: statistical distribution tracking across time windows, drift detection against a baseline, and alerting when distributions change beyond a threshold.&lt;/p&gt;
&lt;p&gt;For ML platform teams, this means data quality metrics and model performance metrics can be tracked in the same observability framework. A drop in model AUC correlates with a detected distribution drift in the &lt;code&gt;user_recency&lt;/code&gt; feature, the data quality monitor fired three days before the model quality dropped, which is the right time to investigate and retrain.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Databricks Data Quality Monitor configuration (Python SDK)
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorMetric, MonitorSpec

client = WorkspaceClient()

# Create a data quality monitor for a feature table
client.quality_monitors.create(
    table_name=&amp;quot;prod_catalog.features.user_activity_features&amp;quot;,
    assets_dir=f&amp;quot;/Shared/monitors/user_activity_features&amp;quot;,
    output_schema_name=&amp;quot;prod_catalog.data_quality_metrics&amp;quot;,
    time_series=MonitorTimeSeries(
        timestamp_col=&amp;quot;feature_timestamp&amp;quot;,
        granularities=[&amp;quot;1 day&amp;quot;]
    ),
    baseline=MonitorBaseline(
        table_name=&amp;quot;prod_catalog.features.user_activity_features_baseline&amp;quot;
    )
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The generated quality metrics (column-level drift scores, null rate changes, distribution summaries), are written to a Unity Catalog table. They can be joined with MLflow experiment data to correlate data quality events with model performance changes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Unified Observability Architecture&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/mlflow-data-pipelines/mlflow3-unified-observability.png&quot; alt=&quot;MLflow 3 unified ML and AI observability layer showing training data through feature engineering to model training to registered model to inference/GenAI agent, all covered by MLflow 3 observability with data quality monitoring, experiment tracking, model registry, and tracing&quot;&gt;&lt;/p&gt;
&lt;p&gt;The practical implementation connects three systems:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;OpenLineage or Databricks Unity Catalog&lt;/strong&gt; provides dataset-level lineage: which jobs read and wrote which tables, and when.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MLflow&lt;/strong&gt; provides model-level lineage: which dataset version trained which model, what the metrics were, and what the current production model&apos;s trace looks like.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Quality Monitoring&lt;/strong&gt; provides feature-level drift signals: when the statistical properties of model inputs change.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;An on-call engineer investigating a model quality alert can navigate this chain: check the model&apos;s MLflow run to see which dataset snapshot it was trained on → check the data quality monitor to see if the current feature distribution matches the training distribution → check lineage to find which upstream jobs modified the features → check the Airflow run history to find the failing job.&lt;/p&gt;
&lt;p&gt;This investigation path is possible today with a combination of tools. The direction of both MLflow 3 and Databricks&apos; monitoring features is to reduce the manual connection between these layers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Distinguishing Model Drift from Data Drift&lt;/h2&gt;
&lt;p&gt;One of the most practically useful applications of unified observability is separating model drift (the model&apos;s predictions are degrading because the model itself is outdated) from data drift (the data feeding the model has changed, but the model&apos;s logic would still work correctly if the data were as expected).&lt;/p&gt;
&lt;p&gt;Without integrated observability, every model performance degradation alert requires the same investigation: pull the production predictions, compare against ground truth labels (where available), check feature distributions, check data pipeline logs. This process takes hours even for experienced teams.&lt;/p&gt;
&lt;p&gt;With unified observability:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data drift detection fires first.&lt;/strong&gt; The data quality monitor detects that &lt;code&gt;user_session_count_7d&lt;/code&gt; has a mean of 2.1 versus the training baseline of 4.3. This statistical anomaly is logged three days before model performance begins to degrade visibly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lineage traces the cause.&lt;/strong&gt; The lineage graph shows that &lt;code&gt;user_session_count_7d&lt;/code&gt; is computed from the &lt;code&gt;user_sessions&lt;/code&gt; table, which was modified by a pipeline run on a specific date. The pipeline run log shows a schema migration that silently changed the session window from 7 days to 1 day.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The model is not broken.&lt;/strong&gt; Reverting the session window computation restores the feature to expected distribution, and model performance recovers. A full model retrain was not necessary.&lt;/p&gt;
&lt;p&gt;This scenario (data pipeline bug causing model degradation that looks like model drift), is common enough that teams without integrated observability routinely retrain models unnecessarily. The retrain is expensive (compute cost, team time, validation process) and doesn&apos;t fix the underlying data pipeline issue.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Example: correlate data quality alerts with model metrics
import mlflow
import pandas as pd

# Load data quality monitor output
drift_metrics = pd.read_parquet(
    &amp;quot;s3://monitoring/data_quality_metrics/user_activity_features/&amp;quot;
)

# Load MLflow experiment runs for the production model
client = mlflow.MlflowClient()
runs = client.search_runs(
    experiment_ids=[&amp;quot;churn_prediction&amp;quot;],
    filter_string=&amp;quot;tags.env = &apos;production&apos;&amp;quot;,
    order_by=[&amp;quot;start_time DESC&amp;quot;]
)

# Check correlation between feature drift and model AUC
for run in runs:
    run_date = pd.Timestamp(run.info.start_time, unit=&amp;quot;ms&amp;quot;).date()
    feature_drift = drift_metrics[
        drift_metrics[&amp;quot;date&amp;quot;] == str(run_date)
    ][&amp;quot;session_count_drift_score&amp;quot;].values

    if len(feature_drift) &amp;gt; 0:
        print(f&amp;quot;Date: {run_date}, AUC: {run.data.metrics.get(&apos;auc&apos;, &apos;N/A&apos;)}, &amp;quot;
              f&amp;quot;Session Count Drift: {feature_drift[0]:.3f}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;CI/CD for ML Pipelines: Where MLflow Fits&lt;/h2&gt;
&lt;p&gt;CI/CD for machine learning requires validation gates that check not just whether code compiles, but whether the trained model meets quality thresholds before it can be promoted to production.&lt;/p&gt;
&lt;p&gt;MLflow&apos;s model registry provides the infrastructure for this gate. A CI pipeline that trains a new model version can use MLflow&apos;s API to check whether the candidate model meets minimum metrics before allowing a production deployment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow

def validate_and_promote_model(run_id: str, min_auc: float = 0.90) -&amp;gt; bool:
    &amp;quot;&amp;quot;&amp;quot;
    Validate a model run and promote to production if metrics pass.
    Used in CI/CD pipeline gate.
    &amp;quot;&amp;quot;&amp;quot;
    client = mlflow.MlflowClient()
    run = client.get_run(run_id)

    auc = run.data.metrics.get(&amp;quot;auc&amp;quot;, 0.0)
    if auc &amp;lt; min_auc:
        print(f&amp;quot;FAIL: AUC {auc:.3f} below threshold {min_auc}&amp;quot;)
        return False

    # Check data quality metrics from the training run
    training_dataset = client.get_run(run_id).inputs.dataset_inputs[0]

    # Promote to candidate stage if metrics pass
    model_version = client.create_model_version(
        name=&amp;quot;churn_predictor&amp;quot;,
        source=f&amp;quot;runs:/{run_id}/model&amp;quot;,
        run_id=run_id
    )

    client.transition_model_version_stage(
        name=&amp;quot;churn_predictor&amp;quot;,
        version=model_version.version,
        stage=&amp;quot;Staging&amp;quot;
    )

    print(f&amp;quot;PASS: Model v{model_version.version} promoted to Staging&amp;quot;)
    return True
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The CI pipeline calls &lt;code&gt;validate_and_promote_model()&lt;/code&gt; after each training run. If the model passes, it enters Staging for integration testing. Human approval then promotes it to Production, MLflow&apos;s stage transitions support this workflow directly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;MLflow with Airflow: Scheduling Training and Monitoring Together&lt;/h2&gt;
&lt;p&gt;Airflow DAGs that combine model training, data quality monitoring, and model validation create a fully automated retraining workflow. When a data quality monitor detects significant feature drift, it can trigger an Airflow DAG that retrains the model on the latest data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.base import BaseSensorOperator
from datetime import datetime, timedelta

def check_feature_drift():
    &amp;quot;&amp;quot;&amp;quot;Check if feature drift exceeds retraining threshold.&amp;quot;&amp;quot;&amp;quot;
    drift_scores = load_drift_metrics(&amp;quot;user_activity_features&amp;quot;, days_back=1)
    max_drift = max(drift_scores.values())
    if max_drift &amp;gt; 0.15:  # PSI threshold for significant drift
        raise ValueError(f&amp;quot;Feature drift detected: {max_drift:.3f}&amp;quot;)

def retrain_model():
    &amp;quot;&amp;quot;&amp;quot;Retrain and log to MLflow.&amp;quot;&amp;quot;&amp;quot;
    with mlflow.start_run(run_name=f&amp;quot;auto_retrain_{datetime.now().date()}&amp;quot;):
        training_data = load_latest_training_data()
        model = train_xgboost(training_data)
        mlflow.log_metric(&amp;quot;auc&amp;quot;, evaluate_auc(model, validation_data))
        mlflow.xgboost.log_model(model, &amp;quot;model&amp;quot;)

with DAG(&amp;quot;ml_retraining_pipeline&amp;quot;, schedule_interval=&amp;quot;@daily&amp;quot;) as dag:
    check_drift = PythonOperator(
        task_id=&amp;quot;check_feature_drift&amp;quot;,
        python_callable=check_feature_drift
    )

    retrain = PythonOperator(
        task_id=&amp;quot;retrain_model&amp;quot;,
        python_callable=retrain_model
    )

    check_drift &amp;gt;&amp;gt; retrain
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern (drift-triggered retraining with MLflow experiment logging), creates a self-maintaining ML system that responds to data pipeline changes without manual intervention from the ML team.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The MLflow Model Registry and Production Deployment&lt;/h2&gt;
&lt;p&gt;MLflow&apos;s Model Registry is where experimentation transitions to production. The registry tracks model versions, their lifecycle stage (Staging, Production, Archived), and the training metadata (runs, datasets, and lineage), associated with each version.&lt;/p&gt;
&lt;p&gt;The lifecycle stage system enables controlled promotions. A data scientist trains a new model version that achieves better performance on validation metrics. They register it in the Model Registry, and it enters the Staging stage. A model review process (which might include automated evaluation against a holdout dataset, human review of the training data and feature set, and comparison against the current production model), gates the promotion to Production.&lt;/p&gt;
&lt;p&gt;For regulated industries, this controlled promotion process with full MLflow run metadata creates the audit trail that compliance teams require: exactly which training data snapshot, which code version, and which hyperparameter configuration produced the model that was promoted to production.&lt;/p&gt;
&lt;p&gt;MLflow&apos;s Model Registry also integrates with feature stores. When a model is registered, the registry can record which feature view and which point-in-time cutoff was used to generate training features. This integration is critical for detecting training-serving skew, if the feature engineering logic changes between training and serving, the model inputs no longer match what the model was trained on, often causing silent performance degradation without triggering obvious errors.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Experiment Tracking at Scale: Managing Model Development Velocity&lt;/h2&gt;
&lt;p&gt;Individual experiment tracking is straightforward. Experiment tracking across a team of 20 data scientists working on multiple concurrent projects requires deliberate organizational design.&lt;/p&gt;
&lt;p&gt;The key decisions for team-scale experiment tracking:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Experiment and run naming conventions.&lt;/strong&gt; MLflow organizes runs within experiments. Without naming conventions, experiments become &amp;quot;Untitled&amp;quot; and runs become &amp;quot;run*1234&amp;quot;, unintelligible after a week. Standardize experiment names as &lt;code&gt;{project}/{model_type}/{feature_set}&lt;/code&gt; and run names as &lt;code&gt;{date}*{developer}\_{brief_description}&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hyperparameter tagging.&lt;/strong&gt; Log all hyperparameters (not just the ones you&apos;re tuning), to enable filtering runs by architecture, optimizer, or data configuration months later. Teams frequently revisit experiments to understand why a particular approach was abandoned. Complete parameter logging makes this retrospective possible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Model performance baselines.&lt;/strong&gt; Track a &lt;code&gt;baseline_metric&lt;/code&gt; tag on all runs that records the current production model&apos;s performance on the same evaluation set. This makes every experiment run immediately interpretable: is this run better or worse than what&apos;s deployed today?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Artifact storage discipline.&lt;/strong&gt; Every run that crosses a performance threshold logs its serialized model artifact to MLflow&apos;s artifact store. Every run below threshold does not, to avoid artifact store bloat. Define the threshold in a shared config file so the decision is consistent across the team.&lt;/p&gt;
&lt;p&gt;MLflow&apos;s search API enables fleet-level analysis of experiment runs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import mlflow

# Find all runs from the last 30 days that beat the baseline
runs = mlflow.search_runs(
    experiment_names=[&amp;quot;customer-churn/gradient-boosting/v2-features&amp;quot;],
    filter_string=&amp;quot;metrics.val_auc &amp;gt; 0.89 AND tags.dataset_version = &apos;v2025-05&apos;&amp;quot;,
    order_by=[&amp;quot;metrics.val_auc DESC&amp;quot;],
    max_results=50
)

# Compare against the current production model&apos;s metrics
production_auc = 0.876
candidates = runs[runs[&amp;quot;metrics.val_auc&amp;quot;] &amp;gt; production_auc * 1.02]  # 2% improvement threshold
print(f&amp;quot;Found {len(candidates)} candidates exceeding the promotion threshold&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This programmatic experiment comparison enables automated model evaluation pipelines that trigger promotion reviews when new training runs exceed the promotion threshold.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;MLflow 3&apos;s expansion into GenAI tracing, dataset tracking, and evaluation makes it a more complete observability platform for modern ML systems that blend classical models and generative AI. The integration with data quality monitoring closes the gap between pipeline observability and model observability.&lt;/p&gt;
&lt;p&gt;The practical work for data engineering teams is instrumenting pipelines to emit the lineage and quality signals that make this chain navigable: OpenLineage events from Airflow, Spark, and Flink; data quality monitors on feature tables; and MLflow dataset logging in training pipelines.&lt;/p&gt;
&lt;p&gt;The most valuable operational improvement from integrated observability is faster root cause analysis for model degradations. Distinguishing data drift from model drift, correlating feature distribution changes with model performance drops, and tracing data quality issues back to specific pipeline runs: all of this is possible today with MLflow 3, Databricks Data Quality Monitoring, and OpenLineage. The tools exist and are stable enough for production use.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build Observable ML Platforms&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on AI-native data architecture, MLOps, and lakehouse integration, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides query-accelerated access to your Iceberg training data and feature tables, reducing the compute cost of feature engineering pipelines. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Modern Feature Stores Beyond Batch Pipelines</title><link>https://iceberglakehouse.com/posts/2026-05-24-modern-feature-stores/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-modern-feature-stores/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-modern-feature-stores/).

The or...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-modern-feature-stores/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The original value proposition of a feature store was straightforward: define features once, use them in both training and serving. The feature engineering logic that computed the &lt;code&gt;user_30d_purchase_count&lt;/code&gt; feature for training data would be the same logic that computed it for inference, no more training-serving skew where the model trains on slightly different features than it receives in production.&lt;/p&gt;
&lt;p&gt;That problem is real and important. But the batch-only feature store has a significant limitation: the features it can provide are as fresh as the batch pipeline that computes them. For models that make real-time decisions (fraud detection, recommendation ranking, dynamic pricing), batch features computed every hour or every day are not fresh enough.&lt;/p&gt;
&lt;p&gt;The next generation of feature stores adds streaming feature views: feature computations that run continuously against Kafka or Kinesis event streams, populating an online store with features that are seconds-fresh rather than hours-fresh. Feast, the most widely-used open-source feature store, supports streaming feature views natively, with event sources from Kafka and Kinesis and online store backends including Redis, DynamoDB, and Bigtable.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Two-Store Model&lt;/h2&gt;
&lt;p&gt;The feature store architecture separates two concerns: &lt;strong&gt;historical features for training&lt;/strong&gt; (offline store) and &lt;strong&gt;fresh features for inference&lt;/strong&gt; (online store).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/modern-feature-stores/feast-streaming-feature-view-architecture.png&quot; alt=&quot;Feast streaming feature view architecture showing Kafka and Kinesis event sources feeding Feast Feature Server with streaming feature views and point-in-time correct joins, routing to Redis online store for &amp;lt;10ms real-time serving and Parquet/Iceberg offline store for model training&quot;&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;offline store&lt;/strong&gt; is a data warehouse or lakehouse table that holds historical feature values. When training a model, you retrieve a training dataset by joining entity keys (user IDs, product IDs) with their historical feature values at the correct point in time, a technique called point-in-time correct joining. This prevents training data leakage, where future feature values contaminate historical training examples.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;online store&lt;/strong&gt; is a low-latency key-value store (Redis, DynamoDB) that holds the most recent feature values for each entity. At inference time, the model server retrieves features by entity key with sub-10ms latency, fast enough for synchronous real-time scoring in a serving API.&lt;/p&gt;
&lt;p&gt;The feature registry is the central registry of feature definitions. The same YAML or Python definition that configures how a feature is computed in the offline store also configures how it&apos;s computed in streaming. This is the mechanism that ensures training-serving consistency: there is one definition of &lt;code&gt;user_30d_purchase_count&lt;/code&gt;, and both stores compute it the same way.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Streaming Feature Views in Feast&lt;/h2&gt;
&lt;p&gt;A streaming feature view in Feast connects a Kafka or Kinesis source to a feature computation that runs continuously:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from feast import Entity, FeatureView, Field, KafkaSource, RedisOnlineStore
from feast.types import Float64, Int64
from datetime import timedelta

# Define an entity (the key for feature lookup)
user = Entity(name=&amp;quot;user_id&amp;quot;, join_keys=[&amp;quot;user_id&amp;quot;])

# Define a streaming source from Kafka
kafka_source = KafkaSource(
    name=&amp;quot;user_actions_kafka&amp;quot;,
    kafka_bootstrap_servers=&amp;quot;localhost:9092&amp;quot;,
    topic=&amp;quot;user_actions&amp;quot;,
    batch_source=FileSource(path=&amp;quot;s3://features/user_actions/&amp;quot;),  # Fallback for training
    message_format=JsonFormat(
        schema_json=&apos;{&amp;quot;user_id&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;action_type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;amount&amp;quot;: &amp;quot;double&amp;quot;}&apos;
    ),
    timestamp_field=&amp;quot;event_timestamp&amp;quot;
)

# Define a streaming feature view with window aggregations
user_activity_fv = FeatureView(
    name=&amp;quot;user_activity_features&amp;quot;,
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name=&amp;quot;purchase_count_1h&amp;quot;, dtype=Int64),
        Field(name=&amp;quot;purchase_amount_1h&amp;quot;, dtype=Float64),
        Field(name=&amp;quot;session_count_24h&amp;quot;, dtype=Int64)
    ],
    online=True,  # Materialize to online store
    source=kafka_source,
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The streaming processor computes window aggregations (purchase count in the last hour, purchase amount in the last hour) from the event stream and writes results to the online store continuously. The offline store receives a batch version of the same computation for training data generation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Training-Serving Skew Problem&lt;/h2&gt;
&lt;p&gt;Training-serving skew is the core problem feature stores solve. Consider a churn prediction model trained with a feature &lt;code&gt;days_since_last_purchase&lt;/code&gt;. During training, this is computed as &lt;code&gt;today - max(purchase_date)&lt;/code&gt; from a historical purchase table. During inference, the same feature might be computed from a different table, with a different filter, or with a different date cutoff, producing a value that&apos;s close to but not exactly what the model was trained on.&lt;/p&gt;
&lt;p&gt;At scale, small inconsistencies in feature computation compound. A model trained with specific feature distributions performs worse in production because the production feature distributions don&apos;t match training. Debugging requires comparing feature computation code across pipeline boundaries; often owned by different teams.&lt;/p&gt;
&lt;p&gt;The feature store eliminates this by centralizing feature definitions. The feature registry holds the canonical computation logic. Training pipelines use the registry to generate training data. Serving infrastructure uses the registry to compute inference features. Both use the same code path for the same feature definition.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Training: retrieve historical features for a training dataset
from feast import FeatureStore

store = FeatureStore(repo_path=&amp;quot;./feature_repo&amp;quot;)

training_df = store.get_historical_features(
    entity_df=pd.DataFrame({
        &amp;quot;user_id&amp;quot;: user_ids,
        &amp;quot;event_timestamp&amp;quot;: training_cutoff_dates
    }),
    features=[
        &amp;quot;user_activity_features:purchase_count_1h&amp;quot;,
        &amp;quot;user_activity_features:session_count_24h&amp;quot;
    ]
).to_df()

# Inference: retrieve online features for real-time scoring
online_features = store.get_online_features(
    features=[
        &amp;quot;user_activity_features:purchase_count_1h&amp;quot;,
        &amp;quot;user_activity_features:session_count_24h&amp;quot;
    ],
    entity_rows=[{&amp;quot;user_id&amp;quot;: user_id}]
).to_dict()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same feature names, the same registry, the same underlying computation logic, just different execution contexts (historical scan vs online store lookup).&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Modern Feature Store Architecture&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/modern-feature-stores/feature-store-offline-online-architecture.png&quot; alt=&quot;Feature store offline and online architecture with batch features from S3/Iceberg via Spark and real-time features from Kafka via Flink, connected through Feast feature registry for training-serving consistency&quot;&gt;&lt;/p&gt;
&lt;p&gt;The integration of Feast into Kubeflow MLOps pipelines has made feature stores more visible to the broader ML platform community. Kubeflow&apos;s model training and serving components reference Feast as the recommended feature management layer, which means teams building Kubeflow-based ML platforms have a clear integration point for feature governance.&lt;/p&gt;
&lt;p&gt;For data engineering teams, the practical implication is that feature engineering is no longer purely a data pipeline concern. Features that were previously computed ad-hoc in Spark notebooks for training and re-implemented in application code for serving now have a shared definition layer. Data engineering involvement in feature definition and maintenance is expected, not optional.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Using Apache Iceberg as the Feature Offline Store&lt;/h2&gt;
&lt;p&gt;One of the most significant recent developments in feature store architecture is the use of Apache Iceberg tables as the offline store backend. Iceberg brings several properties to offline feature storage that traditional Parquet-on-S3 approaches lack.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time travel for training data generation.&lt;/strong&gt; Iceberg&apos;s snapshot-based versioning means you can generate training data using a specific snapshot of the feature table, ensuring that the training data snapshot is stable and reproducible even if the feature table continues to be updated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema evolution without migration cost.&lt;/strong&gt; Adding new features to the offline store doesn&apos;t require rewriting existing Parquet files. Iceberg&apos;s schema evolution handles column additions and type widening without data movement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ACID writes for concurrent feature computation.&lt;/strong&gt; When multiple Spark jobs are writing different feature groups to the same offline store table, Iceberg&apos;s ACID transaction support prevents partial writes and read-after-write inconsistencies.&lt;/p&gt;
&lt;p&gt;Feast supports Iceberg as an offline store backend through its plugin system:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# feast/feature_store.yaml
project: my_feature_store
registry: s3://feature-registry/registry.db
provider: local

offline_store:
    type: feast_iceberg.IcebergOfflineStore
    catalog_name: my_catalog
    catalog_type: rest
    uri: https://polaris.example.com/api/catalog
    warehouse: s3://features/iceberg-warehouse/
    token: &amp;lt;token&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With Iceberg as the offline store, feature retrieval for training uses Iceberg&apos;s predicate pushdown for efficient scan performance, and the feature history is preserved through Iceberg snapshots rather than requiring separate time-partitioned Parquet directories.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Feature Governance and Discovery&lt;/h2&gt;
&lt;p&gt;A feature store without governance becomes a feature graveyard. Features are added for specific models and then abandoned when those models are deprecated. Without ownership tracking, the registry fills with stale feature definitions that nobody maintains.&lt;/p&gt;
&lt;p&gt;Effective feature governance requires:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ownership tracking.&lt;/strong&gt; Every feature view in the registry should have a named owner (an individual or a team), who is accountable for keeping the computation logic current and the feature quality above threshold. Feast&apos;s registry supports tagging feature views with ownership metadata.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quality thresholds.&lt;/strong&gt; Features have expected statistical properties. A &lt;code&gt;days_since_last_purchase&lt;/code&gt; feature shouldn&apos;t have null rates above 5% for active user entities. Setting expected statistics and alerting when feature distributions shift prevents degraded models from silently serving incorrect predictions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Deprecation workflows.&lt;/strong&gt; When a model that uses a feature is retired, the feature itself might have no remaining consumers. A deprecation workflow that checks consumer count before allowing feature deletion prevents accidental removal of features still used by secondary models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-team discoverability.&lt;/strong&gt; The feature registry is only valuable if teams can find features they need instead of recomputing them. A searchable registry with human-readable descriptions, entity types, and sample values reduces duplicate feature engineering work across teams.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;On-Demand Features and Real-Time Transformations&lt;/h2&gt;
&lt;p&gt;Feast supports on-demand feature views: transformations that are computed at request time rather than pre-computed and stored. This is useful for features that depend on the request context, for example, the distance between a user&apos;s current location and a candidate restaurant in a recommendation system.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from feast import OnDemandFeatureView, Field, RequestSource
from feast.types import Float64
import pandas as pd

# Define request data schema (available at inference time from the request context)
request_source = RequestSource(
    name=&amp;quot;request_features&amp;quot;,
    schema=[
        Field(name=&amp;quot;user_lat&amp;quot;, dtype=Float64),
        Field(name=&amp;quot;user_lon&amp;quot;, dtype=Float64),
    ]
)

@on_demand_feature_view(
    sources=[request_source, restaurant_feature_view],
    schema=[Field(name=&amp;quot;distance_km&amp;quot;, dtype=Float64)]
)
def compute_distance(inputs: pd.DataFrame) -&amp;gt; pd.DataFrame:
    &amp;quot;&amp;quot;&amp;quot;Haversine distance from user location to restaurant.&amp;quot;&amp;quot;&amp;quot;
    R = 6371  # Earth radius in km
    lat1 = inputs[&amp;quot;user_lat&amp;quot;].values
    lon1 = inputs[&amp;quot;user_lon&amp;quot;].values
    lat2 = inputs[&amp;quot;restaurant_lat&amp;quot;].values
    lon2 = inputs[&amp;quot;restaurant_lon&amp;quot;].values

    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = (dlat/2).map(lambda x: x**2) + (dlon/2).map(lambda x: x**2)
    # Simplified haversine
    c = 2 * (a**0.5).map(lambda x: min(x, 1.0))
    outputs = pd.DataFrame()
    outputs[&amp;quot;distance_km&amp;quot;] = R * c
    return outputs
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On-demand features combine with pre-computed online features in a single retrieval call, giving model serving both the pre-computed aggregations (purchase history, session count) and the request-time computations (current distance, real-time price delta) in one unified feature vector.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Enterprise Feature Stores: Databricks Feature Store and Vertex AI Feature Store&lt;/h2&gt;
&lt;p&gt;Beyond open-source Feast, major cloud platforms provide managed feature store services with tighter integration into their ML ecosystems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Databricks Feature Store&lt;/strong&gt; (now part of Unity Catalog) integrates feature storage and governance with the broader Unity Catalog data governance layer. Feature tables are Iceberg tables registered in Unity Catalog, with the same lineage tracking, access control, and discoverability that apply to any other Unity Catalog dataset. The tight integration with Databricks Model Registry means model cards automatically record which feature tables a model depends on.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vertex AI Feature Store&lt;/strong&gt; on Google Cloud provides online/offline feature serving with BigQuery as the offline store backend and Bigtable for low-latency online serving. Vertex Feature Store 2.0 (released 2024) added streaming ingest directly to the online store via BigQuery continuous queries, reducing the engineering overhead of maintaining separate batch and streaming feature computation pipelines.&lt;/p&gt;
&lt;p&gt;Both managed services trade flexibility for operational simplicity. Teams that are deeply invested in a single cloud provider benefit from the reduced operational overhead. Teams that need cross-cloud model serving, or that use Feast for portability, should weigh the managed service benefits against the coupling to a single provider.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;When to Introduce a Feature Store&lt;/h2&gt;
&lt;p&gt;Feature stores solve real problems, but they add complexity. Teams should consider introducing a feature store when they encounter:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Training-serving skew in production.&lt;/strong&gt; When model performance in production consistently lags behind offline evaluation metrics and the root cause is feature computation inconsistency, a feature store addresses the problem directly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Redundant feature engineering.&lt;/strong&gt; When multiple teams are independently computing the same features (30-day purchase count, days since last login) from the same raw data, centralizing feature computation in a registry reduces duplicate work and ensures consistency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-time model serving requirements.&lt;/strong&gt; When models need sub-second feature retrieval for synchronous API scoring, a feature store with an online store backend (Redis, DynamoDB) provides the access pattern that batch feature pipelines can&apos;t match.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For small teams with a handful of models and no real-time serving requirements, the overhead of running a feature store may not be justified. A well-organized set of dbt models producing feature tables can serve many of the same purposes with less infrastructure complexity.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Feature stores in 2026 have moved from a niche ML platform tool to a standard component in teams that build real-time predictive models. The combination of streaming feature views (for freshness), point-in-time correct historical features (for accurate training data), and a central feature registry (for training-serving consistency) addresses the three most common sources of ML model degradation in production.&lt;/p&gt;
&lt;p&gt;Using Apache Iceberg as the offline store backend brings additional benefits: time travel for reproducible training datasets, schema evolution without migration cost, and ACID writes for concurrent feature computation. The combination of Feast with an Iceberg offline store and Redis online store represents a production-proven architecture for teams that need both training reproducibility and real-time inference performance.&lt;/p&gt;
&lt;p&gt;For data engineering teams, the operational responsibility is maintaining the streaming infrastructure that feeds streaming feature views (Kafka topics with the right schemas, Flink or Feast streaming processors), the batch pipelines that populate the offline store for training data generation, and the governance discipline that keeps the feature registry current and discoverable.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Feature Store Governance: The Metadata Layer&lt;/h2&gt;
&lt;p&gt;A feature store without governance creates a new category of technical debt. As the feature registry grows to hundreds of features defined by dozens of teams, discoverability degrades and duplicate feature definitions accumulate. The solution is treating the feature registry as a governed data catalog, not just a configuration file.&lt;/p&gt;
&lt;p&gt;Effective feature store governance requires:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Feature ownership.&lt;/strong&gt; Every feature view should have a designated owner, a team or individual responsible for its correctness, freshness, and documentation. Ownership is enforced through the registry metadata, not through social convention. When a consumer discovers that a feature is stale, they know immediately who to contact.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Freshness SLAs.&lt;/strong&gt; Features consumed for real-time inference have latency requirements. A feature that should be updated every 5 minutes but hasn&apos;t been updated in 3 hours is serving stale values. Freshness SLAs (defined in the feature view metadata and monitored through platform alerts), catch staleness before it silently degrades model performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Deprecation workflows.&lt;/strong&gt; Feature registries accumulate technical debt as models are retrained with better features and old feature views become unused. Without deprecation workflows, unused features continue consuming compute resources for their materialization jobs. A deprecation workflow identifies unused features (by tracking which models consume which features), marks them as deprecated with a sunset date, and eventually removes their materialization jobs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-team discoverability.&lt;/strong&gt; The feature registry&apos;s primary value proposition is reuse, a fraud detection team&apos;s transaction velocity features might also be valuable for a credit risk team&apos;s model. This reuse only happens if teams can discover what features exist. Investing in feature documentation (including example values, distributions, and known caveats) dramatically improves cross-team reuse rates.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Training-Serving Skew Problem&lt;/h2&gt;
&lt;p&gt;Training-serving skew is one of the most common and most damaging failure modes in production ML systems. It occurs when the feature values used to train a model differ systematically from the feature values served to the model during inference. The result is a model that performs well in evaluation (against held-out training data) but poorly in production (against actual inference-time features).&lt;/p&gt;
&lt;p&gt;The causes of training-serving skew:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Different feature computation logic.&lt;/strong&gt; The offline batch job that computes features for training runs different code than the online service that computes features for inference. A subtle difference (perhaps a different treatment of missing values, or a different time window for an aggregation), produces different values for the same raw input.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Different data sources.&lt;/strong&gt; Training features are computed from the offline store (Iceberg tables, data warehouse). Serving features are read from the online store (Redis, DynamoDB). If the pipelines that populate these two stores have different latency characteristics or different data cleaning logic, the values diverge.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Point-in-time retrieval errors.&lt;/strong&gt; Training requires point-in-time correct feature values, the values that were available at the time of each training example&apos;s label, not the current values. Failing to implement proper point-in-time retrieval introduces future data leakage into training features, causing artificially high training performance that doesn&apos;t generalize.&lt;/p&gt;
&lt;p&gt;Feature stores address training-serving skew by centralizing feature computation in a single pipeline that serves both offline and online stores. When both the training dataset generation and the real-time inference path read features from the same computation logic, the risk of skew from code divergence is eliminated. Point-in-time retrieval is handled natively by the feature store&apos;s training dataset generation API.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build ML-Ready Data Platforms&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on AI-native data architecture and agentic ML systems, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For unified lakehouse access to your Iceberg-backed feature offline store with query acceleration, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>OpenLineage as the Spine of Data Observability</title><link>https://iceberglakehouse.com/posts/2026-05-24-openlineage-observability/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-openlineage-observability/</guid><description>
Data platform incidents follow a predictable pattern. A pipeline fails or a dashboard goes stale. Someone opens Slack and asks which table feeds that...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Data platform incidents follow a predictable pattern. A pipeline fails or a dashboard goes stale. Someone opens Slack and asks which table feeds that dashboard. Someone else checks the Airflow UI and traces it to a Spark job. A third person pulls up the dbt DAG and realizes the issue is three steps upstream in a staging model that reads from an Iceberg table that failed due to a schema change two days ago. The entire investigation takes hours of manual archaeology.&lt;/p&gt;
&lt;p&gt;OpenLineage was built to make this archaeology unnecessary. It provides a standardized API for tools across the data stack (Airflow, Spark, Flink, dbt), to emit structured lineage events as pipelines run. Those events flow to a lineage backend (Marquez, DataHub, or similar) that assembles them into a searchable, queryable dependency graph. When something breaks, the graph shows exactly which downstream assets are affected without requiring a human to trace the dependency tree manually.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What OpenLineage Actually Is&lt;/h2&gt;
&lt;p&gt;OpenLineage is a specification, not a tool. It defines a JSON event format for describing pipeline runs, the datasets they read, and the datasets they write. Any tool that emits events in this format is an OpenLineage producer. Any backend that ingests and stores these events is a compatible consumer.&lt;/p&gt;
&lt;p&gt;The core event types are:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RunEvent:&lt;/strong&gt; Records the start, complete, or failure of a pipeline run (a DAG task, a Spark job, a dbt model execution). Contains a unique run ID, a job name, and arrays of input and output datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DatasetEvent:&lt;/strong&gt; Records a change to a dataset outside the context of a specific job, for example, a schema change applied via DDL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;JobEvent:&lt;/strong&gt; Records changes to a job definition, for example, a DAG being updated in Airflow.&lt;/p&gt;
&lt;p&gt;Each event carries &lt;strong&gt;facets&lt;/strong&gt;, structured metadata payloads that extend the base event with specific information. The &lt;code&gt;SchemaDatasetFacet&lt;/code&gt; records column names and types for a dataset. The &lt;code&gt;DataQualityMetricsInputDatasetFacet&lt;/code&gt; records row counts and null rates for input datasets. The &lt;code&gt;SqlJobFacet&lt;/code&gt; records the SQL query associated with a job. Facets are extensible: you can define custom facets for organization-specific metadata without breaking the standard format.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Integration Across the Stack&lt;/h2&gt;
&lt;p&gt;The value of OpenLineage comes from its coverage. A single tool emitting lineage events gives you partial visibility. When Airflow, Spark, Flink, and dbt all emit events, the resulting graph shows complete end-to-end provenance.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/openlineage-observability/openlineage-integration-ecosystem.png&quot; alt=&quot;OpenLineage integration ecosystem hub-and-spoke showing Apache Airflow, Spark, Flink, dbt, Trino, Great Expectations, and custom producers all connected to central OpenLineage JSON Events API, routing to Marquez, DataHub, and other catalog APIs&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Airflow:&lt;/strong&gt; The &lt;code&gt;apache-airflow-providers-openlineage&lt;/code&gt; package integrates at the operator level. It intercepts task execution events and automatically emits OpenLineage run events for each task in a DAG. It propagates parent run IDs so downstream backends can reconstruct the full orchestration hierarchy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Spark:&lt;/strong&gt; The Spark integration uses a Spark listener, a JVM agent that intercepts read and write operations at the execution plan level. No code changes to Spark jobs are required. The listener reads the physical plan, identifies input and output datasets, and emits run events with schema facets at job start and completion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Flink:&lt;/strong&gt; Similar to Spark, the Flink integration operates as an agent that captures streaming lineage without modifying job code. This is particularly valuable for streaming pipelines where the data flow is complex and documentation is often out of date.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;dbt:&lt;/strong&gt; The &lt;code&gt;dbt-ol&lt;/code&gt; integration captures lineage from dbt model executions. Each &lt;code&gt;dbt run&lt;/code&gt; emits events for every model that runs, recording the &lt;code&gt;ref()&lt;/code&gt; and &lt;code&gt;source()&lt;/code&gt; dependencies as dataset relationships, and the compiled SQL as a facet on the job event.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Blast Radius Analysis in Practice&lt;/h2&gt;
&lt;p&gt;The most immediate operational benefit of a live lineage graph is blast radius analysis, understanding what breaks when something changes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/openlineage-observability/openlineage-blast-radius-lineage-graph.png&quot; alt=&quot;OpenLineage lineage graph showing raw events from S3 and Kafka flowing through Spark ETL to Iceberg tables, then to dbt models and Tableau dashboards, with red AFFECTED highlighting propagating from a column drop in stg_events&quot;&gt;&lt;/p&gt;
&lt;p&gt;A practical scenario: an analyst drops a column from the &lt;code&gt;stg_events&lt;/code&gt; Iceberg table, perhaps cleaning up an obsolete field that was supposed to be unused. With a complete lineage graph, you can run a lineage query against the OpenLineage backend before making the change:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Query Marquez for downstream consumers of stg_events
curl -X GET &amp;quot;http://marquez:5000/api/v1/datasets/my_namespace/stg_events/lineage?depth=3&amp;quot; \
  | jq &apos;.graph | [.[] | select(.type == &amp;quot;DATASET&amp;quot;)] | map(.id)&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The response shows every downstream dataset and job that reads from &lt;code&gt;stg_events&lt;/code&gt;, up to three hops downstream. You discover that &lt;code&gt;fct_sessions&lt;/code&gt;, &lt;code&gt;dashboard_kpis&lt;/code&gt;, and the ML training Airflow DAG all have direct or indirect dependencies on the column you planned to drop. What looked like a safe cleanup is now a breaking change that requires coordinating with three teams before executing.&lt;/p&gt;
&lt;p&gt;This is not a theoretical benefit. In organizations running OpenLineage at scale, the difference between this query and manual archaeology is measured in hours.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The OpenLineage Event Model in Practice&lt;/h2&gt;
&lt;p&gt;Emitting custom OpenLineage events from a Python pipeline:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import SchemaDatasetFacet, SchemaField
from openlineage.client.dataset import Dataset
import uuid
from datetime import datetime, timezone

client = OpenLineageClient.from_environment()

# Start event when job begins
run_id = str(uuid.uuid4())
job_name = &amp;quot;my_custom_etl&amp;quot;
namespace = &amp;quot;production&amp;quot;

client.emit(
    RunEvent(
        eventType=RunState.START,
        eventTime=datetime.now(timezone.utc).isoformat(),
        run=Run(runId=run_id),
        job=Job(namespace=namespace, name=job_name),
        inputs=[
            Dataset(namespace=namespace, name=&amp;quot;raw_events&amp;quot;,
                   facets={&amp;quot;schema&amp;quot;: SchemaDatasetFacet(
                       fields=[SchemaField(&amp;quot;event_id&amp;quot;, &amp;quot;BIGINT&amp;quot;),
                               SchemaField(&amp;quot;event_type&amp;quot;, &amp;quot;STRING&amp;quot;),
                               SchemaField(&amp;quot;ts&amp;quot;, &amp;quot;TIMESTAMP&amp;quot;)]
                   )})
        ],
        outputs=[
            Dataset(namespace=namespace, name=&amp;quot;stg_events&amp;quot;)
        ]
    )
)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Backends: Marquez and DataHub&lt;/h2&gt;
&lt;p&gt;Marquez is the reference implementation backend for OpenLineage, an open-source metadata service that provides an API and basic UI for storing and querying lineage events. It&apos;s the easiest path to get started: run it with Docker Compose, point your OpenLineage producers at it, and the lineage graph starts populating immediately.&lt;/p&gt;
&lt;p&gt;DataHub provides a more complete data catalog experience, integrating lineage with search, ownership, documentation, and data quality signals. Its OpenLineage integration allows lineage events to flow into the DataHub graph alongside metadata collected from other sources.&lt;/p&gt;
&lt;p&gt;For organizations that need enterprise governance features alongside lineage (access control, data classification, business glossary) DataHub or commercial platforms like Atlan or Alation that support OpenLineage ingestion are the better fit.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;OpenLineage addresses the observability gap that has made data platform incidents disproportionately expensive to diagnose. A data pipeline fails in ways that aren&apos;t visible to the tools monitoring individual components, Airflow knows the task failed, but doesn&apos;t know what the downstream effects are. Spark knows the job completed, but doesn&apos;t know what business dashboard depends on the table it wrote.&lt;/p&gt;
&lt;p&gt;By standardizing the lineage event format across tools, OpenLineage makes it possible to build a single, authoritative dependency graph that spans the entire platform. That graph makes blast radius analysis instant and root cause investigation tractable without manual archaeology.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Column-Level Lineage: The Next Frontier&lt;/h2&gt;
&lt;p&gt;Dataset-level lineage (&amp;quot;job X reads from table A and writes to table B&amp;quot;) is the baseline. Column-level lineage (&amp;quot;column C in table B is derived from columns D and E in table A through transformation F&amp;quot;) is substantially more powerful for impact analysis.&lt;/p&gt;
&lt;p&gt;Column-level lineage makes it possible to answer: if we change the data type of &lt;code&gt;user_id&lt;/code&gt; in the &lt;code&gt;users&lt;/code&gt; source table from INTEGER to BIGINT, which downstream columns are affected? Without column-level lineage, the answer requires reading every downstream pipeline&apos;s SQL. With column-level lineage in the graph, the query returns the complete affected column set in milliseconds.&lt;/p&gt;
&lt;p&gt;The Spark OpenLineage integration captures column-level lineage from the physical plan for SQL operations, including joins, aggregations, and transformations. The dbt integration captures column-level lineage from the compiled SQL of each model using SQL parsing.&lt;/p&gt;
&lt;p&gt;For custom Python pipelines, column-level lineage requires explicitly declaring column mappings in the &lt;code&gt;ColumnLineageDatasetFacet&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from openlineage.client.facet import (
    ColumnLineageDatasetFacet,
    ColumnLineageDatasetFacetFieldsAdditional,
    ColumnLineageDatasetFacetFieldsAdditionalInputFields
)

# Declare column-level lineage for a custom transformation
column_lineage = ColumnLineageDatasetFacet(
    fields={
        &amp;quot;total_spend&amp;quot;: ColumnLineageDatasetFacetFieldsAdditional(
            inputFields=[
                ColumnLineageDatasetFacetFieldsAdditionalInputFields(
                    namespace=&amp;quot;production&amp;quot;,
                    name=&amp;quot;raw_orders&amp;quot;,
                    field=&amp;quot;order_amount&amp;quot;
                )
            ],
            transformationDescription=&amp;quot;SUM over 90-day window&amp;quot;,
            transformationType=&amp;quot;AGGREGATE&amp;quot;
        ),
        &amp;quot;customer_segment&amp;quot;: ColumnLineageDatasetFacetFieldsAdditional(
            inputFields=[
                ColumnLineageDatasetFacetFieldsAdditionalInputFields(
                    namespace=&amp;quot;production&amp;quot;,
                    name=&amp;quot;raw_orders&amp;quot;,
                    field=&amp;quot;total_spend&amp;quot;
                )
            ],
            transformationDescription=&amp;quot;CASE WHEN segment classification&amp;quot;,
            transformationType=&amp;quot;CONDITIONAL&amp;quot;
        )
    }
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Column-level lineage is more expensive to collect and store than dataset-level lineage, but for high-value data products where understanding transformation provenance is critical, the investment is justified.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;SLA Tracking with OpenLineage&lt;/h2&gt;
&lt;p&gt;Pipeline SLAs define when a dataset should be ready: &amp;quot;The &lt;code&gt;fct_daily_revenue&lt;/code&gt; table must be available by 6 AM UTC for the morning executive dashboard.&amp;quot; Tracking SLA compliance requires knowing when each pipeline run completed and comparing it against the expected completion time.&lt;/p&gt;
&lt;p&gt;OpenLineage events carry completion timestamps that backends can use for SLA monitoring. Marquez&apos;s API supports SLA queries directly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Check recent runs of a job for SLA compliance
curl &amp;quot;http://marquez:5000/api/v1/jobs/production/fct_daily_revenue_etl/runs?limit=10&amp;quot; | \
  jq &apos;.runs[] | {
    run_id: .id,
    started_at: .startedAt,
    ended_at: .endedAt,
    state: .state,
    duration_minutes: (
      ((.endedAt | fromdateiso8601) - (.startedAt | fromdateiso8601)) / 60 | round
    )
  }&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Building SLA monitoring on top of OpenLineage data requires three components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;SLA definitions:&lt;/strong&gt; A configuration file or catalog record that defines the expected completion time for each dataset.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Completion detection:&lt;/strong&gt; A listener or scheduled query that checks when the relevant &lt;code&gt;COMPLETE&lt;/code&gt; event was emitted for each monitored job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert delivery:&lt;/strong&gt; A notification when a pipeline hasn&apos;t emitted its &lt;code&gt;COMPLETE&lt;/code&gt; event before the SLA deadline.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach treats SLA monitoring as a metadata query problem rather than an infrastructure monitoring problem, you&apos;re checking the lineage graph for expected events, not polling health endpoints.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Integrating OpenLineage with Data Quality Tools&lt;/h2&gt;
&lt;p&gt;The data quality facets in the OpenLineage spec allow quality monitoring tools like Great Expectations, Soda, and dbt tests to emit their results alongside pipeline events. This creates an integrated observability surface where you can see not just &amp;quot;the pipeline ran&amp;quot; but &amp;quot;the pipeline ran and the output data met quality expectations.&amp;quot;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Great Expectations integration: emit quality results as OpenLineage facets
from openlineage.client.facet import DataQualityMetricsInputDatasetFacet

quality_facet = DataQualityMetricsInputDatasetFacet(
    rowCount=1_542_783,
    bytes=2_847_291_024,
    columnMetrics={
        &amp;quot;user_id&amp;quot;: {
            &amp;quot;nullCount&amp;quot;: 0,
            &amp;quot;distinctCount&amp;quot;: 1_542_783,
            &amp;quot;quantiles&amp;quot;: {&amp;quot;0.1&amp;quot;: 1000, &amp;quot;0.5&amp;quot;: 500000, &amp;quot;0.9&amp;quot;: 1400000}
        },
        &amp;quot;order_amount&amp;quot;: {
            &amp;quot;nullCount&amp;quot;: 127,
            &amp;quot;min&amp;quot;: 0.01,
            &amp;quot;max&amp;quot;: 49999.99,
            &amp;quot;sum&amp;quot;: 87_293_441.50
        }
    }
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When a data quality check fails, the OpenLineage backend shows the failure alongside the pipeline run event. Downstream SLA monitoring can check not just whether the pipeline completed but whether it completed with passing quality metrics.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Building a Data Observability Culture&lt;/h2&gt;
&lt;p&gt;OpenLineage provides the technical foundation for data observability, but tooling alone doesn&apos;t create an observable data platform. The cultural and organizational practices around using lineage data matter as much as the implementation.&lt;/p&gt;
&lt;p&gt;The most effective data observability practices share common patterns. The first is establishing a lineage review process for any pipeline change that touches a widely-consumed dataset. Before a data engineer renames a column, drops a table, or changes a transformation, a lineage query shows which downstream assets are affected. This makes the impact assessment step fast and routine rather than slow and anxious.&lt;/p&gt;
&lt;p&gt;The second is using lineage data in incident postmortems. When a pipeline failure or data quality incident is resolved, the postmortem includes a lineage analysis: which upstream changes contributed to the incident? Which downstream assets were affected? What would have been visible in the lineage graph that could have detected the issue earlier? Postmortems that include lineage analysis produce actionable improvements to monitoring configurations.&lt;/p&gt;
&lt;p&gt;The third is making lineage visible to data consumers, not just platform engineers. When a data analyst can open the data catalog, click on the dashboard they use daily, and trace its lineage back to source systems (seeing what pipelines feed it, when those pipelines last ran successfully, and whether any upstream quality checks are failing), they develop intuitions about data trustworthiness that improve their analytical work. Analysts who understand that the revenue dashboard reads from Iceberg tables that were last updated four hours ago ask better questions about data freshness than analysts who have no visibility into their data&apos;s provenance.&lt;/p&gt;
&lt;p&gt;The barrier to this visibility is often not technical but organizational. Platform teams that treat lineage data as internal infrastructure rather than a consumer-facing feature miss the organizational benefit. The goal is a culture where &amp;quot;check the lineage&amp;quot; is a natural first response to data questions, the same way &amp;quot;check the logs&amp;quot; is a natural first response to software incidents.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;OpenLineage and Data Catalog Integration&lt;/h2&gt;
&lt;p&gt;OpenLineage&apos;s most powerful use is when lineage events are consumed by a data catalog that provides a searchable, visual interface for the lineage graph. Tools like OpenMetadata, DataHub, and Atlan consume OpenLineage events through their integrations and build navigable lineage graphs in their catalogs.&lt;/p&gt;
&lt;p&gt;The integration pattern is straightforward: the Marquez or other OpenLineage backend stores events, and the data catalog either reads from Marquez or receives OpenLineage events directly through its own API. The catalog then presents lineage as a feature of each data asset&apos;s profile, alongside description, schema, quality metrics, and access information.&lt;/p&gt;
&lt;p&gt;This integration enables impact analysis at catalog query time. When a data engineer needs to make a change to a source table, they can open the table in the catalog, click on the lineage tab, and immediately see all downstream assets that depend on it, dbt models, Spark jobs, dashboards, and reports. Impact analysis that previously required searching through code repositories and asking colleagues is now a self-service catalog operation.&lt;/p&gt;
&lt;p&gt;Automated impact notification takes this further. Governance platforms that integrate with OpenLineage can automatically notify owners of downstream assets when an upstream table&apos;s schema changes. The Iceberg &lt;code&gt;SchemaChange&lt;/code&gt; event, emitted through OpenLineage when a column is added or type is changed, triggers notifications to every team that owns a downstream asset consuming that schema. This replaces informal Slack notifications and runbook checklist items with automated, reliable communication.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build Reliable Data Platforms&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on data platform observability, governance, and lakehouse architecture, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides native query lineage and integrated observability for your Iceberg lakehouse. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>When Paimon Beats Iceberg for Mutable Streams</title><link>https://iceberglakehouse.com/posts/2026-05-24-paimon-vs-iceberg-mutable-streams/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-paimon-vs-iceberg-mutable-streams/</guid><description>
Most lakehouse format comparisons skip the part that actually matters for streaming teams: how the format handles mutations. Apache Iceberg is excell...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Most lakehouse format comparisons skip the part that actually matters for streaming teams: how the format handles mutations. Apache Iceberg is excellent for append-heavy analytics, schema evolution, and multi-engine compatibility. But feed a high-churn CDC stream of updates and deletes into Iceberg using merge-on-read (MoR), and you&apos;re managing a growing pile of delete files that accumulate between compaction runs.&lt;/p&gt;
&lt;p&gt;Apache Paimon takes a different approach. Its Log-Structured Merge-tree (LSM-tree) architecture is designed from the ground up for continuous upserts. For the right workload (high-frequency mutations, Flink-native execution, real-time table freshness requirements) Paimon produces a cleaner operational profile than Iceberg. For the wrong workload, it&apos;s an unnecessary complexity burden.&lt;/p&gt;
&lt;p&gt;This post defines the specific conditions where Paimon wins, where Iceberg remains the better default, and what you actually need to configure to use either effectively.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What Paimon Is and Where It Came From&lt;/h2&gt;
&lt;p&gt;Apache Paimon graduated to a Top-Level Project at the Apache Software Foundation in 2024 and has since progressed through several production-ready releases. It grew out of the Flink Table Store project, which explains why its design assumptions are so tightly aligned with Apache Flink.&lt;/p&gt;
&lt;p&gt;Paimon distinguishes itself from Iceberg and Delta Lake through its choice of storage data structure. While Iceberg stores data as immutable Parquet snapshots and Hudi uses a record-level index for updates, Paimon uses an LSM-tree, the same family of data structures that underlies systems like LevelDB, RocksDB, and Apache Cassandra.&lt;/p&gt;
&lt;p&gt;The choice of LSM-tree is not incidental. It&apos;s a direct response to the specific access pattern of high-frequency updates.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;How the LSM-Tree Handles Updates Differently&lt;/h2&gt;
&lt;p&gt;In an LSM-tree, incoming writes land in an in-memory buffer, which is periodically flushed to sorted files on disk. These sorted files are organized into levels (L0, L1, L2, and so on). Background compaction processes merge smaller files from lower levels into larger files at higher levels, resolving conflicts between earlier and later writes to the same key.&lt;/p&gt;
&lt;p&gt;For a CDC stream where the same row might be updated many times per minute, this has concrete benefits. New CDC events always write to the in-memory buffer and are flushed to the lowest level. The query engine reads the merged view across levels. The background compaction consolidates updates at rest without blocking writes. At no point does a query need to scan separate delete files and reconcile them against data files, the LSM-tree&apos;s merge logic handles this intrinsically.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/paimon-vs-iceberg-mutable-streams/paimon-lsm-vs-iceberg-mor-architecture.png&quot; alt=&quot;Architecture diagram comparing Paimon LSM-tree with in-memory buffer, SSTable file levels, and background compaction against Iceberg merge-on-read with separate data files and delete files requiring query-time merge&quot;&gt;&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s approach is different. Under MoR semantics, an update to a row in Iceberg does not rewrite the data file. Instead, it writes a delete file that records the position or equality of the row to be removed, and writes a new data file containing the updated row. Queries must read both the original data files and the corresponding delete files, then apply the delete records to produce the correct result.&lt;/p&gt;
&lt;p&gt;This works fine when updates are infrequent. When a table receives thousands of updates per second across a high-cardinality key space, delete files accumulate faster than compaction can remove them. Query performance degrades because the engine must read and reconcile more and more file pairs. The recommended remediation is more frequent compaction, which adds operational overhead and resource contention.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Paimon&apos;s Primary Key Table: The Core Streaming Primitive&lt;/h2&gt;
&lt;p&gt;The central concept in Paimon for streaming workloads is the primary key table. When you define a table with a primary key, Paimon routes all writes for that key through the LSM-tree, resolving conflicts using the configured merge engine.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a Paimon primary key table for CDC ingestion from MySQL
CREATE TABLE customer_orders (
    order_id    BIGINT PRIMARY KEY NOT ENFORCED,
    customer_id BIGINT,
    status      STRING,
    amount      DECIMAL(10, 2),
    updated_at  TIMESTAMP(3)
) WITH (
    &apos;connector&apos;                 = &apos;paimon&apos;,
    &apos;path&apos;                      = &apos;s3://data-lake/paimon/customer_orders&apos;,
    &apos;bucket&apos;                    = &apos;8&apos;,
    &apos;changelog-producer&apos;        = &apos;lookup&apos;,
    &apos;merge-engine&apos;              = &apos;deduplicate&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;changelog-producer&lt;/code&gt; property controls how Paimon generates downstream changelog records for consumers of this table:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;input&lt;/code&gt;&lt;/strong&gt;: Assumes the input stream already contains full changelog events (+-I, -U, +U, -D). Use this when consuming from a Debezium CDC source.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;lookup&lt;/code&gt;&lt;/strong&gt;: Generates changelogs by performing a point query on the existing table state before each write. This ensures accurate before-and-after pairs even when the input stream doesn&apos;t carry them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;full-compaction&lt;/code&gt;&lt;/strong&gt;: Generates changelogs by comparing table states across full compaction cycles. Produces the most accurate changelogs but with higher latency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;merge-engine&lt;/code&gt; property controls how conflicts for the same primary key are resolved. &lt;code&gt;deduplicate&lt;/code&gt; keeps the last write. For aggregation use cases (such as a running balance or session counter) &lt;code&gt;aggregation&lt;/code&gt; allows you to define column-level merge functions like &lt;code&gt;sum&lt;/code&gt;, &lt;code&gt;max&lt;/code&gt;, or &lt;code&gt;last_non_null&lt;/code&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Changelog Stream Feature: Why Paimon Tables Are Different from Iceberg Tables&lt;/h2&gt;
&lt;p&gt;One of Paimon&apos;s most distinctive capabilities is that primary key tables can serve as both a batch-readable lakehouse table and a changelog source for downstream Flink jobs.&lt;/p&gt;
&lt;p&gt;When a Flink job reads a Paimon primary key table as a streaming source, it doesn&apos;t read static snapshots. It reads the changelog stream, a continuous stream of &lt;code&gt;+I&lt;/code&gt;, &lt;code&gt;-U&lt;/code&gt;, &lt;code&gt;+U&lt;/code&gt;, and &lt;code&gt;-D&lt;/code&gt; records that represent every mutation to the table. This means you can chain Flink jobs where the output of one job becomes the changelog input of the next, building multi-stage stateful streaming pipelines that maintain real-time derived tables.&lt;/p&gt;
&lt;p&gt;Iceberg can serve as a streaming source in Flink through incremental reads, but its model is snapshot-based. Flink reads successive Iceberg snapshots and emits new or deleted rows detected between them. This works for append-only tables and bounded update patterns, but doesn&apos;t produce the full changelog semantics that Paimon emits natively. Building an accurate changelog from Iceberg incremental reads requires additional logic to handle updates that touch the same row across multiple snapshots.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Where Paimon Falls Short&lt;/h2&gt;
&lt;p&gt;Paimon&apos;s ecosystem reach is its most significant constraint. As of mid-2025, the query engines with strong Paimon read support are Apache Flink and Apache Spark. Trino has a community-maintained Paimon connector, but it lacks some of the more advanced Paimon table features. Engines like Dremio, DuckDB, and Snowflake don&apos;t have native Paimon integration.&lt;/p&gt;
&lt;p&gt;If your architecture requires ad-hoc SQL from multiple query engines, particularly BI tools that connect through Trino or Presto, Iceberg&apos;s ecosystem compatibility is a clear advantage. Nearly every modern query engine supports Iceberg out of the box.&lt;/p&gt;
&lt;p&gt;Paimon does offer an Iceberg-compatible read path, which exposes Paimon tables as Iceberg tables to engines that support the Iceberg REST Catalog API. This compatibility layer allows engines like Spark and Trino to read Paimon data without native Paimon support. However, the compatibility layer is read-only and doesn&apos;t expose Paimon&apos;s changelog semantics. You get table data but not the streaming mutation stream.&lt;/p&gt;
&lt;p&gt;Another constraint is operational maturity. Iceberg has a larger user base, more documented failure patterns, and more tooling for maintenance, governance, and catalog integration. Teams evaluating Paimon for production use should plan for less community documentation on edge cases and a steeper learning curve on tuning the LSM-tree parameters.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Workload Decision Matrix&lt;/h2&gt;
&lt;p&gt;The decision between Paimon and Iceberg narrows to two dimensions: how frequently your tables receive updates and how many different engines need to read the data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/paimon-vs-iceberg-mutable-streams/paimon-vs-iceberg-workload-decision-matrix.png&quot; alt=&quot;Two-by-two decision matrix showing Paimon as best choice for high-frequency CDC workloads with narrow Flink-focused ecosystem, and Iceberg as best for broad multi-engine access regardless of update frequency&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;High-frequency CDC, Flink-native execution, real-time freshness required:&lt;/strong&gt; This is Paimon&apos;s optimal use case. Examples include order management systems where rows update dozens of times per order lifecycle, inventory tracking with near-continuous SKU-level updates, and user session tables where states change multiple times per minute. The LSM-tree handles the churn cleanly without delete file accumulation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch ETL with broad BI access:&lt;/strong&gt; Iceberg wins here without competition. If your primary workload is appending daily partitions and querying from Trino, Dremio, Snowflake, and Spark, Iceberg&apos;s multi-engine support and mature governance features make it the clear choice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mixed workload, multi-engine access, moderate update frequency:&lt;/strong&gt; Both work. For teams with existing Iceberg infrastructure and moderate CDC volume, tuning Iceberg&apos;s compaction settings and using copy-on-write (CoW) for large-batch updates is often simpler than introducing a second table format. Adopt Paimon selectively for the tables where it demonstrably helps, rather than as a wholesale platform replacement.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Practical Configuration for High-Churn Paimon Tables&lt;/h2&gt;
&lt;p&gt;When tuning a Paimon primary key table for a high-churn CDC source, three settings matter most.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bucket count:&lt;/strong&gt; Paimon distributes data by primary key across a fixed number of buckets. Each bucket holds an LSM-tree. More buckets allow more write parallelism but increase the number of small files at low data volumes. For a table with millions of rows and hundreds of thousands of updates per minute, 16 to 64 buckets is a reasonable starting range.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compaction trigger:&lt;/strong&gt; Paimon triggers compaction when the number of sorted files at level 0 exceeds a threshold. The default threshold is 5. For very high write rates, reducing this to 3 keeps the LSM-tree shallow and maintains consistent read performance. For lower write rates, increasing the threshold to 8 or 10 reduces compaction frequency and I/O overhead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Full-compaction interval:&lt;/strong&gt; For tables serving changelog consumers, schedule periodic full compaction to ensure that changelog events are complete and accurate. Lookup-mode changelog producers generate accurate changelogs on individual writes, but full compaction provides a consistency checkpoint that catches any drift between levels.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Configure a high-churn Paimon table with aggressive compaction settings
ALTER TABLE customer_orders SET (
    &apos;num-sorted-run.compaction-trigger&apos; = &apos;3&apos;,
    &apos;full-compaction.delta-commits&apos;     = &apos;20&apos;,
    &apos;write.merge-engine&apos;                = &apos;deduplicate&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Paimon is not a general-purpose Iceberg replacement. It&apos;s a purpose-built tool for a specific problem: high-frequency mutations in streaming lakehouse architectures, particularly where Apache Flink is the primary compute engine and real-time table freshness is a hard requirement.&lt;/p&gt;
&lt;p&gt;For append-heavy pipelines, mixed-engine analytics, or organizations that have already invested in Iceberg governance tooling, Iceberg remains the better default. The format choice should follow the workload, not the other way around.&lt;/p&gt;
&lt;p&gt;The clearest signal that Paimon is worth evaluating is mounting operational complexity around Iceberg compaction on high-churn tables. If you&apos;re spending more time managing delete file accumulation and compaction schedules than building pipeline features, Paimon&apos;s LSM-tree model is worth testing against your specific throughput numbers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Paimon Tags: Batch-Compatible Snapshots for CDC Tables&lt;/h2&gt;
&lt;p&gt;One of Paimon&apos;s useful operational features is the Tag system. Unlike Iceberg&apos;s snapshot-based time travel (which is tied to all historical snapshots), Paimon Tags allow you to mark specific points in a table&apos;s history for long-term retention.&lt;/p&gt;
&lt;p&gt;Tags are particularly valuable for CDC tables where you want to support both the streaming changelog use case and the batch analytics use case simultaneously:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a daily tag for batch processing access
CALL sys.create_tag(&apos;my_catalog.default.customer_orders&apos;, &apos;2025-05-24&apos;, 2 /*snapshot-id*/);

-- Read from a tagged version for batch analytics
SELECT * FROM customer_orders /*+ OPTIONS(&apos;scan.tag-name&apos;=&apos;2025-05-24&apos;) */;

-- Expire snapshots while retaining tags
CALL sys.expire_snapshots(&apos;my_catalog.default.customer_orders&apos;, &apos;2025-05-24 00:00:00&apos;, 10 /*retain-latest*/);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tags persist independently from snapshots. You can expire Paimon snapshots aggressively to control storage costs while retaining daily or weekly tags for historical analytical access. This gives CDC tables the same time-travel capability that makes Iceberg valuable for audit use cases, without the storage cost of retaining every intermediate snapshot.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Streaming Aggregations with Paimon&apos;s Aggregation Merge Engine&lt;/h2&gt;
&lt;p&gt;One of Paimon&apos;s most distinctive features is its native support for streaming aggregations, running computations that accumulate over time directly in the table format.&lt;/p&gt;
&lt;p&gt;The aggregation merge engine allows defining column-level merge functions that resolve conflicts for the same primary key:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Paimon table for session-level aggregations
CREATE TABLE user_sessions (
    user_id             BIGINT PRIMARY KEY NOT ENFORCED,
    session_count       INT,
    total_purchase_amt  DOUBLE,
    last_active         TIMESTAMP(3),
    active_days         BIGINT
) WITH (
    &apos;connector&apos;     = &apos;paimon&apos;,
    &apos;path&apos;          = &apos;s3://data-lake/paimon/user_sessions&apos;,
    &apos;merge-engine&apos;  = &apos;aggregation&apos;,
    &apos;fields.session_count.aggregate-function&apos;         = &apos;sum&apos;,
    &apos;fields.total_purchase_amt.aggregate-function&apos;    = &apos;sum&apos;,
    &apos;fields.last_active.aggregate-function&apos;           = &apos;last_non_null&apos;,
    &apos;fields.active_days.aggregate-function&apos;           = &apos;count&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Incoming events contain partial updates: a new session event increments &lt;code&gt;session_count&lt;/code&gt; by 1, adds the purchase amount to &lt;code&gt;total_purchase_amt&lt;/code&gt;, and updates &lt;code&gt;last_active&lt;/code&gt; to the event timestamp. Paimon&apos;s LSM-tree merge logic applies these aggregations at compaction time, accumulating the correct running totals without requiring a stateful Flink operator to maintain the aggregation in RocksDB.&lt;/p&gt;
&lt;p&gt;This pattern is particularly efficient for analytics tables that are updated continuously from streaming sources but queried on a batch schedule. The aggregation merge engine handles the incremental state in the table format itself, rather than requiring complex stateful stream processing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Monitoring Paimon Tables in Production&lt;/h2&gt;
&lt;p&gt;Paimon doesn&apos;t have the same ecosystem of monitoring tooling as Iceberg (which benefits from tools like PyIceberg&apos;s table introspection and Spark&apos;s &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt;). But Paimon exposes sufficient system tables for building operational monitoring:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Check LSM-tree file count across buckets
SELECT bucket, level, count(*) as file_count
FROM customer_orders$files
GROUP BY bucket, level
ORDER BY bucket, level;

-- Check snapshot history
SELECT snapshot_id, schema_id, commit_time, total-size
FROM customer_orders$snapshots
ORDER BY commit_time DESC
LIMIT 20;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Metrics to watch in production Paimon environments:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;L0 file count per bucket:&lt;/strong&gt; High L0 file counts (&amp;gt;5) indicate compaction is falling behind write throughput. This degrades read performance as the query engine must merge more sorted runs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction lag time:&lt;/strong&gt; How recently did the last full compaction complete? For changelog-producing tables, stale compaction creates gaps in downstream changelog accuracy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage size per bucket:&lt;/strong&gt; Uneven distribution across buckets indicates poor key distribution. Hot buckets receive disproportionate writes and compact more frequently than cold buckets, creating performance inconsistency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Paimon&apos;s Flink integration also exposes JVM metrics for compaction thread pool saturation, which can be monitored through Prometheus/Grafana for operational alerting.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Go Further with Lakehouse Architecture&lt;/h3&gt;
&lt;p&gt;For a comprehensive guide to modern data architectures including open table format comparisons, streaming lakehouse design, and AI-native data platforms, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To query both Paimon and Iceberg tables with unified sub-second performance and automated reflection caching, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Policy as Code for Lakehouse Governance</title><link>https://iceberglakehouse.com/posts/2026-05-24-policy-as-code-governance/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-policy-as-code-governance/</guid><description>
The traditional approach to data access governance relies on role-based access control: you define roles, assign users to roles, and grant roles acce...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The traditional approach to data access governance relies on role-based access control: you define roles, assign users to roles, and grant roles access to specific tables or schemas. For a team of ten analysts and a handful of sensitive tables, this is manageable. For an organization with hundreds of analysts, dozens of data domains, and fine-grained sensitivity classifications across thousands of tables, RBAC becomes a maintenance burden that governance teams can&apos;t keep current.&lt;/p&gt;
&lt;p&gt;The role explosion problem is real. When access is controlled purely by roles, every new combination of &amp;quot;user group + sensitivity level + regional constraint&amp;quot; requires a new role. Governance teams spend more time managing role assignments than thinking about policy intent. Access reviews become a bureaucratic exercise because nobody can actually read the policy from the role hierarchy.&lt;/p&gt;
&lt;p&gt;Attribute-based access control (ABAC) with policy-as-code addresses this by making policy intent explicit and composable. Instead of managing roles that encode every possible permission combination, you write policies that express rules like &amp;quot;anyone with the ANALYST attribute can see aggregate-level metrics but not individual user records&amp;quot; or &amp;quot;any column tagged PII is masked for all roles except DATA_OWNER.&amp;quot;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Policy Layers&lt;/h2&gt;
&lt;p&gt;Modern lakehouse governance operates at three levels, each serving a different granularity of control.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column-level masking&lt;/strong&gt; controls what individual fields a user sees. A column tagged &lt;code&gt;PII&lt;/code&gt; might return &lt;code&gt;SHA256(email)&lt;/code&gt; for analysts and the raw value for data owners. Column masks are SQL expressions that evaluate per-user at query time, no materialized copies of masked data are required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Row-level filters&lt;/strong&gt; control which rows a user can see. A table with a &lt;code&gt;region_code&lt;/code&gt; column might filter results to &lt;code&gt;WHERE region_code = current_user_attribute(&apos;region&apos;)&lt;/code&gt;, ensuring regional managers only see data for their assigned regions without requiring separate materialized views per region.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Object-level policies&lt;/strong&gt; control whether a user can access a table, schema, or catalog at all. These are the coarser-grained permissions that sit above column and row controls.&lt;/p&gt;
&lt;p&gt;The key architectural property of all three layers: policy evaluation happens at query time against live data. No data copies for different user groups, no materialized views per region, no separate tables for masked versus unmasked data.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Policy Hierarchy: From Organization to Table&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/policy-as-code-governance/policy-hierarchy-lakehouse.png&quot; alt=&quot;Three-level policy hierarchy showing organization-level OPA tag-based policies at top flowing down to data domain policies and then to table-level row filters and column masks&quot;&gt;&lt;/p&gt;
&lt;p&gt;An effective governance architecture organizes policies in a hierarchy:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Organization-level policies&lt;/strong&gt; express intent that applies everywhere: &amp;quot;PII-tagged columns are always masked for external roles.&amp;quot; These live in OPA or in the lakehouse catalog&apos;s tag governance layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Domain-level policies&lt;/strong&gt; refine organization policies for specific data domains: &amp;quot;The finance domain allows ANALYST role to see &lt;code&gt;revenue_total&lt;/code&gt; because it&apos;s not PII, even though it&apos;s CONFIDENTIAL.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table-level policies&lt;/strong&gt; apply specific row filters and column masks based on the table&apos;s data and the query user&apos;s attributes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;OPA: General-Purpose Policy Engine&lt;/h2&gt;
&lt;p&gt;Open Policy Agent (OPA) provides a general-purpose policy engine using the Rego policy language. For data governance use cases, OPA typically sits as a policy decision point that data access layers query before allowing data retrieval.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rego&quot;&gt;# OPA policy: column access by tag and role
package data.access

default allow_column = false

allow_column {
    # Allow if the column is not tagged PII
    not column_is_pii
}

allow_column {
    # Allow if user has DATA_OWNER role even for PII
    input.user.roles[_] == &amp;quot;DATA_OWNER&amp;quot;
}

column_is_pii {
    # Check if this column has a PII tag
    input.column.tags[_] == &amp;quot;PII&amp;quot;
}

# Generate a mask expression for partially visible columns
mask_expression = mask {
    column_is_pii
    not allow_column
    mask := sprintf(&amp;quot;SHA256(%s)&amp;quot;, [input.column.name])
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;OPA works well for governance logic that needs to span multiple data platforms. If your organization uses both Databricks and Snowflake, OPA can serve as the authoritative policy decision point for both, with each platform&apos;s governance layer consulting OPA at query time.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Databricks: Row Filters and Column Masks&lt;/h2&gt;
&lt;p&gt;Databricks implements ABAC-style governance through Unity Catalog&apos;s row filter functions and column masking functions, applied at the table level.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a column masking function for PII
CREATE OR REPLACE FUNCTION masks.email_masker(email STRING)
RETURNS STRING
RETURN CASE
    WHEN is_member(&apos;data_owners&apos;) THEN email
    ELSE CONCAT(LEFT(email, 2), &apos;****@&apos;, SPLIT_PART(email, &apos;@&apos;, 2))
END;

-- Apply the mask to a table column
ALTER TABLE users
ALTER COLUMN email SET MASK masks.email_masker;

-- Create a row filter function for regional access
CREATE OR REPLACE FUNCTION filters.region_filter(region_code STRING)
RETURNS BOOLEAN
RETURN CASE
    WHEN is_member(&apos;global_analysts&apos;) THEN TRUE
    ELSE region_code = current_user_attribute(&apos;region&apos;)
END;

-- Apply the row filter to a table
ALTER TABLE orders
SET ROW FILTER filters.region_filter ON (region_code);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These functions evaluate at query time using the current user&apos;s session context, their roles and attributes. The SQL for the function is stored in Unity Catalog and version-controlled alongside other table metadata.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Snowflake Horizon: Cross-Engine Policy Enforcement&lt;/h2&gt;
&lt;p&gt;Snowflake Horizon extends governance beyond Snowflake to Iceberg tables managed by Snowflake&apos;s Iceberg integration. When external engines (Spark, Trino, Dremio) access Iceberg tables through Snowflake&apos;s REST Catalog endpoint, the same row and column policies that apply to Snowflake-native queries apply to external engine queries.&lt;/p&gt;
&lt;p&gt;This cross-engine policy enforcement is architecturally significant. It means your column masking policies for PII fields apply regardless of whether the query originates from Snowflake SQL, a Spark job, or a BI tool connecting through Trino, all enforced through the catalog layer, not duplicated in each engine.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;BigQuery: Tag-Based Row and Column Security&lt;/h2&gt;
&lt;p&gt;BigQuery implements governance through a combination of data classification tags, row-level access policies, and dynamic data masking:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Apply BigQuery row-level access policy
CREATE OR REPLACE ROW ACCESS POLICY regional_access_policy
ON my_dataset.orders
GRANT TO (&amp;quot;group:us-analysts@company.com&amp;quot;)
FILTER USING (region = &apos;US&apos;);

-- Apply column-level masking policy for sensitive data
CREATE OR REPLACE DATA POLICY email_masking_policy
ON my_dataset.users
USING (MASKING POLICY RULE
    WHEN CURRENT_GROUPS() NOT IN UNNEST([&apos;group:data-owners@company.com&apos;])
    THEN SHA256(email)
);

-- Assign masking policy to column
ALTER TABLE my_dataset.users
ALTER COLUMN email SET DATA POLICY email_masking_policy;
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;Tag-Driven Policy Evaluation&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/policy-as-code-governance/policy-as-code-tag-driven-access-flow.png&quot; alt=&quot;Policy-as-code tag-driven access control flow showing user query request flowing through policy engine consulting column tags and row filter policies, evaluating to allow (full result) or deny/mask, with audit log&quot;&gt;&lt;/p&gt;
&lt;p&gt;The most scalable governance implementations use tags as the bridge between data classification and policy rules. Data engineers tag columns and tables during creation using the catalog API. Governance policies reference tags rather than specific column names. This means adding a new table with properly tagged columns automatically inherits all relevant policies without requiring governance team intervention for each new dataset.&lt;/p&gt;
&lt;p&gt;The tagging workflow in practice:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Tag columns during table creation using Unity Catalog
catalog_client.apply_column_tags(
    catalog=&amp;quot;production&amp;quot;,
    schema=&amp;quot;customer_data&amp;quot;,
    table=&amp;quot;users&amp;quot;,
    column_tags={
        &amp;quot;email&amp;quot;: [&amp;quot;PII&amp;quot;, &amp;quot;GDPR_PERSONAL_DATA&amp;quot;],
        &amp;quot;phone&amp;quot;: [&amp;quot;PII&amp;quot;, &amp;quot;GDPR_PERSONAL_DATA&amp;quot;],
        &amp;quot;user_id&amp;quot;: [&amp;quot;IDENTIFIER&amp;quot;],
        &amp;quot;region&amp;quot;: [&amp;quot;OPERATIONAL&amp;quot;],
        &amp;quot;total_spend&amp;quot;: [&amp;quot;FINANCIAL&amp;quot;, &amp;quot;CONFIDENTIAL&amp;quot;]
    }
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once columns are tagged, governance policies evaluate dynamically based on tags, no policy updates required when new columns are added to existing tables, as long as they&apos;re tagged correctly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Policy-as-code governance replaces a role explosion problem with a classification and expression problem. The discipline required is maintaining accurate column and table tags, writing clear policy expressions, and validating that policies evaluate correctly for each relevant user persona.&lt;/p&gt;
&lt;p&gt;The operational benefit is governance at scale: hundreds of tables with complex sensitivity classifications, regional requirements, and user attribute constraints, all managed through composable policies rather than an unmaintainable role hierarchy.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;CI/CD for Governance Policies&lt;/h2&gt;
&lt;p&gt;One of the most underappreciated benefits of policy-as-code is that policies can go through the same CI/CD workflows as application code. A governance policy change (extending PII masking to a new column, adding a regional restriction, creating a new attribute for a third-party analytics role), is a pull request that gets reviewed, approved, and deployed through the same process as software changes.&lt;/p&gt;
&lt;p&gt;This process change matters for compliance. When access policy changes are version-controlled and require approval, there&apos;s an audit trail of every policy change, who proposed it, who approved it, and when it was deployed. This audit trail is the kind of evidence SOC 2, GDPR data processing audits, and HIPAA compliance reviews require.&lt;/p&gt;
&lt;p&gt;A practical governance CI/CD workflow:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# .github/workflows/governance-policy.yml
name: Governance Policy CI/CD

on:
  pull_request:
    paths:
      - &amp;quot;governance/policies/**&amp;quot;
      - &amp;quot;governance/tags/**&amp;quot;

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install OPA
        run: |
          curl -L -o opa https://openpolicyagent.org/downloads/latest/opa_linux_amd64
          chmod +x opa

      - name: Validate Rego syntax
        run: ./opa check governance/policies/

      - name: Run policy unit tests
        run: ./opa test governance/policies/ governance/tests/ -v

      - name: Simulate policy against test users
        run: |
          python scripts/simulate_policy_evaluation.py \
            --policy-dir governance/policies/ \
            --test-users test-fixtures/user-personas.json \
            --expected-access test-fixtures/expected-access.json

  deploy:
    needs: validate
    runs-on: ubuntu-latest
    if: github.ref == &apos;refs/heads/main&apos;
    steps:
      - name: Apply policies to Unity Catalog
        run: python scripts/apply_policies.py --env production
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Policy unit tests validate that specific user personas get the expected access decisions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# governance/tests/test_pii_masking.py
import subprocess
import json

def test_analyst_sees_masked_email():
    &amp;quot;&amp;quot;&amp;quot;Analyst role should see masked email, not raw PII.&amp;quot;&amp;quot;&amp;quot;
    input_data = {
        &amp;quot;user&amp;quot;: {&amp;quot;roles&amp;quot;: [&amp;quot;ANALYST&amp;quot;], &amp;quot;region&amp;quot;: &amp;quot;US&amp;quot;},
        &amp;quot;column&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;email&amp;quot;, &amp;quot;tags&amp;quot;: [&amp;quot;PII&amp;quot;]},
        &amp;quot;resource&amp;quot;: {&amp;quot;table&amp;quot;: &amp;quot;users&amp;quot;}
    }
    result = subprocess.run(
        [&amp;quot;./opa&amp;quot;, &amp;quot;eval&amp;quot;, &amp;quot;--data&amp;quot;, &amp;quot;governance/policies/&amp;quot;,
         &amp;quot;--input&amp;quot;, &amp;quot;/dev/stdin&amp;quot;, &amp;quot;data.access.allow_column&amp;quot;],
        input=json.dumps(input_data), capture_output=True, text=True
    )
    assert json.loads(result.stdout)[&amp;quot;result&amp;quot;][0][&amp;quot;expressions&amp;quot;][0][&amp;quot;value&amp;quot;] == False

def test_data_owner_sees_raw_email():
    &amp;quot;&amp;quot;&amp;quot;Data owner role should see raw email.&amp;quot;&amp;quot;&amp;quot;
    input_data = {
        &amp;quot;user&amp;quot;: {&amp;quot;roles&amp;quot;: [&amp;quot;DATA_OWNER&amp;quot;], &amp;quot;region&amp;quot;: &amp;quot;US&amp;quot;},
        &amp;quot;column&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;email&amp;quot;, &amp;quot;tags&amp;quot;: [&amp;quot;PII&amp;quot;]},
        &amp;quot;resource&amp;quot;: {&amp;quot;table&amp;quot;: &amp;quot;users&amp;quot;}
    }
    result = subprocess.run(
        [&amp;quot;./opa&amp;quot;, &amp;quot;eval&amp;quot;, &amp;quot;--data&amp;quot;, &amp;quot;governance/policies/&amp;quot;,
         &amp;quot;--input&amp;quot;, &amp;quot;/dev/stdin&amp;quot;, &amp;quot;data.access.allow_column&amp;quot;],
        input=json.dumps(input_data), capture_output=True, text=True
    )
    assert json.loads(result.stdout)[&amp;quot;result&amp;quot;][0][&amp;quot;expressions&amp;quot;][0][&amp;quot;value&amp;quot;] == True
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These tests run in CI for every policy pull request, catching policy regressions before they reach production.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Governance in Practice: Common Patterns and Pitfalls&lt;/h2&gt;
&lt;p&gt;Teams implementing policy-as-code governance consistently encounter the same patterns and pitfalls:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Start with classification, not policies.&lt;/strong&gt; The foundation of effective policy-as-code is a well-maintained taxonomy of sensitivity tags. If columns aren&apos;t consistently tagged, policies can&apos;t apply consistently. Invest in automated tagging during table creation (inference from column names, schema patterns) and manual review workflows for classification confirmation before writing complex policies.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Test every user persona.&lt;/strong&gt; Policies have a way of having unintended consequences for edge case user types. A policy that correctly restricts external partners might accidentally also restrict internal read-only service accounts that need full data access for operational purposes. Test matrices covering all significant user personas (not just the typical analyst and data owner), catch these edge cases before they become incidents.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Avoid policy logic in application code.&lt;/strong&gt; When data access restrictions are duplicated in application code (for example, a BI dashboard that adds a WHERE clause for the current user&apos;s region), governance drifts: the policy in the catalog and the logic in the application can diverge. Centralize all access restrictions in the catalog&apos;s policy layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monitor for policy failures, not just audit logs.&lt;/strong&gt; Audit logs show what queries ran. Policy failure monitoring shows when queries were blocked or returned masked data, and for what reason. Governance teams need both views: audit for compliance evidence, failure monitoring for diagnosing access configuration problems.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Policy-as-Code for AI Agents and Automated Systems&lt;/h2&gt;
&lt;p&gt;One governance challenge that has grown rapidly with the adoption of AI agents is access control for non-human principals. A traditional RBAC model assumes a human user logs in and queries data through a BI tool or SQL client. In 2025, the reality includes AI agents that query data, generate reports, train on datasets, and make data-driven decisions autonomously.&lt;/p&gt;
&lt;p&gt;Policy-as-code frameworks are well-suited to governing AI agent access because they can express intent-based access control, not just &amp;quot;does this principal have read access to this table&amp;quot; but &amp;quot;is this query consistent with the stated purpose of this agent.&amp;quot; For example, an agent that is authorized to answer customer support questions should not be able to query the aggregate financial metrics table, even if that table is technically accessible to the service account the agent runs under.&lt;/p&gt;
&lt;p&gt;Attribute-based access control extends naturally to AI agents. An agent can carry attributes in its authentication token that describe its purpose, its associated team, and its approved data domains:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;principal_type&amp;quot;: &amp;quot;ai_agent&amp;quot;,
  &amp;quot;agent_name&amp;quot;: &amp;quot;customer_support_assistant&amp;quot;,
  &amp;quot;allowed_domains&amp;quot;: [&amp;quot;customer&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;product&amp;quot;],
  &amp;quot;purpose&amp;quot;: &amp;quot;customer_issue_resolution&amp;quot;,
  &amp;quot;approval_scope&amp;quot;: &amp;quot;customer_facing_data_only&amp;quot;,
  &amp;quot;team&amp;quot;: &amp;quot;support_engineering&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Governance policies evaluate these agent attributes the same way they evaluate human user attributes. An agent with &lt;code&gt;allowed_domains: [&amp;quot;customer&amp;quot;, &amp;quot;order&amp;quot;]&lt;/code&gt; is blocked from querying the &lt;code&gt;finance.revenue_summary&lt;/code&gt; table by the same domain restriction policy that applies to human users.&lt;/p&gt;
&lt;p&gt;This design future-proofs the governance layer. As agentic AI systems multiply and become embedded in more data workflows, the policy-as-code framework already handles them, no new access control mechanism is needed, just additional principal types.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Getting Started: A Practical Roadmap&lt;/h2&gt;
&lt;p&gt;For teams moving from RBAC to policy-as-code governance, the transition is best done incrementally rather than all at once. A complete governance overhaul that touches all tables and all roles simultaneously creates significant risk of access disruptions and policy mistakes that affect production workflows.&lt;/p&gt;
&lt;p&gt;A phased approach works better. In the first phase, implement tagging for the highest-sensitivity data assets: PII columns in customer tables, financial data columns, and any data covered by external regulatory requirements. Write and test the policies for those tags. Deploy in shadow mode (logging policy decisions without enforcing them) to validate that the policies produce the expected decisions for all user groups.&lt;/p&gt;
&lt;p&gt;In the second phase, enable enforcement for the policies covering high-sensitivity data. This is the phase where the governance team needs to be actively engaged, answering questions from teams whose data access patterns change. Expect surprises: analysts who had broader access than they needed, service accounts whose access patterns weren&apos;t documented, and edge cases that the policy test suite didn&apos;t cover.&lt;/p&gt;
&lt;p&gt;In the third phase, extend the tagging taxonomy to cover the full data asset inventory and develop policies for the broader classification tiers. By this point, the policy authoring, CI/CD, and validation workflows are established, the third phase is primarily a coverage expansion rather than a capability development exercise.&lt;/p&gt;
&lt;p&gt;The transition from ad-hoc RBAC to systematic policy-as-code governance is a multi-quarter project for most organizations. The investment pays off in governance scalability: when the data platform adds fifty new tables next quarter, the governance burden is tag application (fast) rather than role design and assignment (slow and error-prone).&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build Governed Data Access Patterns&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on lakehouse governance, Iceberg access control, and data platform architecture, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio provides fine-grained column and row-level governance across your Iceberg lakehouse with native RBAC and Catalog-level policies. Try it free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Real-Time Lakehouse Patterns with Apache Flink and Iceberg</title><link>https://iceberglakehouse.com/posts/2026-05-24-real-time-lakehouse-flink/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-real-time-lakehouse-flink/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-real-time-lakehouse-flink/).

Mo...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-real-time-lakehouse-flink/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Most streaming pipelines solve the wrong problem. Teams spend months building infrastructure to move data fast, then discover their downstream lakehouse tables are a mess: thousands of tiny files per partition, schemas that drift silently across topics, and compaction jobs fighting live writes at 3 a.m. The ingestion is fast, but the data is barely usable.&lt;/p&gt;
&lt;p&gt;Apache Flink 2.1, released in July 2025, explicitly frames itself as a unified real-time Data and AI platform. Paired with the Dynamic Iceberg Sink (which reached production-ready status with Apache Iceberg 1.10.0 support), you now have a concrete path to an architecture where Kafka topics land cleanly in Iceberg tables, schema changes never require a job restart, and a single Flink job can serve hundreds of tables simultaneously.&lt;/p&gt;
&lt;p&gt;This post walks through how to actually build that architecture, including the configuration details teams usually skip and the failure modes nobody documents until something breaks in production.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Why the Traditional Kafka-to-Lakehouse Pattern Breaks Down&lt;/h2&gt;
&lt;p&gt;The classic Kafka-to-lake pipeline works like this: a Kafka consumer reads from a topic, transforms the events, and writes Parquet files to S3. Simple, effective, and full of hidden costs.&lt;/p&gt;
&lt;p&gt;The first problem is schema drift. Kafka producers add fields, rename values, and change types. If your consumer expects a fixed schema, it either crashes or silently drops new data. In the best case, your pipeline goes down. In the more common case, you get quiet data loss that doesn&apos;t surface until a report is wrong.&lt;/p&gt;
&lt;p&gt;The second problem is file proliferation. Every Flink checkpoint commits a set of files. If your checkpoint interval is 60 seconds, you produce one set of files per minute. At a 128 MB target file size (the standard recommendation), you can handle that load. But if your data volume drops overnight, you&apos;re writing sub-megabyte files every minute. A week of low-traffic hours can leave a partition with thousands of small files that kill query performance.&lt;/p&gt;
&lt;p&gt;The third problem is operational rigidity. A traditional Flink job defines its source topics and sink tables statically at job startup. Adding a new Kafka topic means modifying the job definition and restarting. For platforms with dozens of microservices publishing to Kafka, that constraint turns the streaming pipeline into a constant maintenance burden.&lt;/p&gt;
&lt;p&gt;The Dynamic Iceberg Sink addresses all three.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Dynamic Iceberg Sink: How It Works&lt;/h2&gt;
&lt;p&gt;The Dynamic Iceberg Sink, supported in Flink 1.20, 2.0, and 2.1 with Apache Iceberg 1.10.0 or newer, removes the static contract between a Flink job and its downstream tables.&lt;/p&gt;
&lt;p&gt;In a traditional Flink-to-Iceberg pipeline, the table schema is defined at job deployment time. If a new field appears in incoming Kafka events, the job has no mechanism to handle it. The field either gets dropped or the job fails on deserialization.&lt;/p&gt;
&lt;p&gt;The Dynamic Sink adds three capabilities that change this:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Automated schema evolution.&lt;/strong&gt; When the sink encounters a field in an incoming record that doesn&apos;t exist in the current Iceberg table schema, it calls the Iceberg catalog to add the new column. Because Iceberg stores its schema as metadata rather than embedded in the data files, this operation doesn&apos;t touch any existing Parquet files. The schema update is a metadata-only write to the catalog. Existing data continues to be readable; new files written after the update include the new field.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multi-table fan-out.&lt;/strong&gt; A single Dynamic Sink instance can write to an unlimited number of Iceberg tables simultaneously. The routing logic is defined in the incoming event records themselves. Your Kafka event includes a routing key (typically a topic name or entity type) and the sink maps that key to the appropriate Iceberg table. If a table doesn&apos;t exist yet, the sink creates it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Automatic table creation.&lt;/strong&gt; When the sink encounters a routing key that maps to a table that doesn&apos;t exist in the catalog, it creates the table on the fly using the schema inferred from the current record. This means onboarding a new Kafka topic requires zero changes to the Flink job.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/real-time-lakehouse-flink/flink-realtime-lakehouse-architecture.png&quot; alt=&quot;Real-time lakehouse pipeline from Kafka topics through Apache Flink job and Dynamic Iceberg Sink to Iceberg tables, BI tools, and ML pipelines&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Setting Up the Pipeline: From Kafka to Iceberg&lt;/h2&gt;
&lt;p&gt;Here&apos;s a minimal working configuration using Flink SQL. This assumes Flink 2.1, the &lt;code&gt;flink-iceberg-runtime&lt;/code&gt; JAR on the classpath, and a Kafka cluster with the Confluent Schema Registry running.&lt;/p&gt;
&lt;h3&gt;Define the Kafka Source&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a Kafka source table in Flink SQL
CREATE TABLE kafka_events (
  `topic`    STRING METADATA FROM &apos;topic&apos; VIRTUAL,
  `payload`  STRING,
  `ts`       TIMESTAMP(3) METADATA FROM &apos;timestamp&apos;,
  WATERMARK FOR `ts` AS `ts` - INTERVAL &apos;5&apos; SECOND
) WITH (
  &apos;connector&apos;                  = &apos;kafka&apos;,
  &apos;topic-pattern&apos;              = &apos;events\\..*&apos;,
  &apos;properties.bootstrap.servers&apos; = &apos;kafka-broker:9092&apos;,
  &apos;properties.group.id&apos;        = &apos;flink-iceberg-ingestor&apos;,
  &apos;scan.startup.mode&apos;          = &apos;latest-offset&apos;,
  &apos;format&apos;                     = &apos;avro-confluent&apos;,
  &apos;avro-confluent.schema-registry.url&apos; = &apos;http://schema-registry:8081&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;topic-pattern&lt;/code&gt; parameter is key. This single source definition captures all topics matching the regular expression &lt;code&gt;events\..*&lt;/code&gt;, which means adding a new topic like &lt;code&gt;events.user_signups&lt;/code&gt; requires no changes to the job.&lt;/p&gt;
&lt;h3&gt;Configure Checkpointing for Exactly-Once Delivery&lt;/h3&gt;
&lt;p&gt;Exactly-once semantics in Flink are a function of checkpointing, not a toggle you set on the Iceberg connector. The Iceberg sink participates in Flink&apos;s checkpointing protocol: when a Flink checkpoint completes, the Iceberg sink commits the data files written during that interval as a new Iceberg snapshot. If the job fails before a checkpoint completes, the uncommitted files are orphaned and the offset position rolls back to the last successful checkpoint.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Enable exactly-once checkpointing
SET &apos;execution.checkpointing.interval&apos;            = &apos;5min&apos;;
SET &apos;execution.checkpointing.mode&apos;                = &apos;EXACTLY_ONCE&apos;;
SET &apos;execution.checkpointing.timeout&apos;             = &apos;10min&apos;;
SET &apos;state.backend&apos;                               = &apos;rocksdb&apos;;
SET &apos;state.backend.incremental&apos;                   = &apos;true&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Five minutes is a reasonable starting checkpoint interval for most production workloads. Shorter intervals produce more files per hour; longer intervals increase recovery time if the job fails. The tradeoff is latency versus operational stability.&lt;/p&gt;
&lt;h3&gt;Configure the Dynamic Iceberg Sink&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Write to Iceberg using the dynamic sink (DataStream API example)
FlinkSink.forRowData(inputStream)
    .tableLoader(TableLoader.fromCatalog(catalogLoader, TableIdentifier.of(&amp;quot;default&amp;quot;, &amp;quot;events_raw&amp;quot;)))
    .upsertMode(false)
    .writeParallelism(8)
    .set(&amp;quot;write.target-file-size-bytes&amp;quot;, String.valueOf(128 * 1024 * 1024)) // 128 MB
    .set(&amp;quot;write.distribution-mode&amp;quot;, &amp;quot;hash&amp;quot;)
    .append();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the Dynamic Sink variant that auto-routes to multiple tables based on a routing field, the configuration is handled through &lt;code&gt;DynamicRecordWriter&lt;/code&gt; in the Iceberg 1.10 DataStream API. The routing key must be present in each record and map to a valid Iceberg table identifier in your catalog.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/real-time-lakehouse-flink/flink-schema-evolution-sequence.png&quot; alt=&quot;Sequence diagram showing schema evolution flow from Kafka Producer through Flink Job and Dynamic Iceberg Sink to Iceberg Catalog and S3 Storage&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Schema Evolution Without Restarts&lt;/h2&gt;
&lt;p&gt;The sequence illustrated above shows what happens when a Kafka producer adds a new field, &lt;code&gt;region&lt;/code&gt;, to an event that previously only had &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt;, and &lt;code&gt;ts&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In a static Flink pipeline, this silently drops the field or crashes the job depending on how the schema validation is configured. In a Dynamic Sink pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The Flink job receives the event and detects the new field during deserialization.&lt;/li&gt;
&lt;li&gt;The Dynamic Sink calls the Iceberg catalog&apos;s &lt;code&gt;updateSchema()&lt;/code&gt; API to add the &lt;code&gt;region&lt;/code&gt; column as a nullable string.&lt;/li&gt;
&lt;li&gt;Because Iceberg schema evolution is a metadata-only operation, no existing data files are rewritten. The catalog records the new column in the table metadata and associates it with a new schema ID.&lt;/li&gt;
&lt;li&gt;The sink writes the current record, including the &lt;code&gt;region&lt;/code&gt; field, to a new Parquet file using the updated schema.&lt;/li&gt;
&lt;li&gt;All downstream readers (Spark, Trino, Dremio) that query the table see the new column for records where it exists. Historical records return NULL for the &lt;code&gt;region&lt;/code&gt; column, which is correct behavior.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One important constraint: Iceberg only supports widening schema evolution, not narrowing. You can add columns, rename columns (with full compatibility tracking), and widen numeric types (e.g., &lt;code&gt;int&lt;/code&gt; to &lt;code&gt;long&lt;/code&gt;). You cannot drop columns via the Dynamic Sink&apos;s schema evolution path. Dropping a column requires an explicit catalog operation outside the streaming job.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Operational Patterns: Static Sink vs. Dynamic Iceberg Sink&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/real-time-lakehouse-flink/flink-static-vs-dynamic-sink-comparison.png&quot; alt=&quot;Comparison table showing differences between Static Sink and Dynamic Iceberg Sink across schema changes, multi-table support, new topic handling, Schema Registry integration, and operational overhead&quot;&gt;&lt;/p&gt;
&lt;p&gt;The operational differences become most pronounced at scale. A platform with 50 microservices publishing to 50 Kafka topics, where each topic&apos;s schema evolves independently, requires 50 static Flink jobs under the traditional model. Adding a field to one schema means deploying a code change and restarting a job. With the Dynamic Sink, one Flink job handles all 50 topics, and schema evolution happens without any operator intervention.&lt;/p&gt;
&lt;p&gt;The tradeoff is schema control. With static sinks, your Flink job&apos;s schema definition acts as a contract enforcement layer. Any event that doesn&apos;t match the expected schema fails fast and loud. The Dynamic Sink&apos;s auto-evolution makes this boundary more permissive. You trade strict contract enforcement for operational flexibility.&lt;/p&gt;
&lt;p&gt;For most production teams, the right answer is to combine the Dynamic Sink with Confluent Schema Registry compatibility rules. Set the Schema Registry to &lt;code&gt;FULL_TRANSITIVE&lt;/code&gt; compatibility on your topics, which ensures producers can only make backward-compatible schema changes. The Dynamic Sink then handles the Iceberg-side evolution automatically, while the Schema Registry enforces that producers don&apos;t break downstream consumers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Managing Small Files from Streaming Writes&lt;/h2&gt;
&lt;p&gt;Every Flink checkpoint produces at least one data file per active partition. With a 5-minute checkpoint interval and data spread across 20 partitions, you produce at least 20 files every 5 minutes. Over 24 hours, that&apos;s 5,760 small files per day before any other workload pressure.&lt;/p&gt;
&lt;p&gt;The files don&apos;t need to be large to cause problems. Query planners read manifest files to build execution plans, and each manifest entry is a file reference. Scanning thousands of manifest entries before reading a single data row degrades planning performance, even when the data itself is small.&lt;/p&gt;
&lt;p&gt;There are two approaches to controlling this, and you need both.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At write time: tune file sizing and writer parallelism.&lt;/strong&gt; Set &lt;code&gt;write.target-file-size-bytes&lt;/code&gt; to 128 MB or higher. Use &lt;code&gt;write.distribution-mode = hash&lt;/code&gt; to route records within each Flink task by partition key before writing, which ensures each task fills larger files rather than writing many small, scattered ones.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;After landing: use Flink native table maintenance.&lt;/strong&gt; Starting with Iceberg 1.7, the &lt;code&gt;flink-iceberg-runtime&lt;/code&gt; JAR includes built-in table maintenance actions that run inside a Flink job. This eliminates the dependency on a separate Spark cluster for compaction.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;// Run Iceberg compaction natively inside a Flink job
TableLoader loader = TableLoader.fromCatalog(catalogLoader,
    TableIdentifier.of(&amp;quot;default&amp;quot;, &amp;quot;events_raw&amp;quot;));

RewriteDataFilesSparkAction rewrite = SparkActions
    .get()
    .rewriteDataFiles(table)
    .option(&amp;quot;target-file-size-bytes&amp;quot;, Long.toString(128L * 1024 * 1024))
    .filter(Expressions.lessThan(&amp;quot;ts&amp;quot;, currentHourMinus2()));

RewriteDataFilesSparkAction.Result result = rewrite.execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A critical operational rule: never compact the hot partition currently receiving streaming writes. The compaction job reads a set of files, rewrites them into larger files, and commits a new snapshot that removes the original files. If your streaming job is concurrently writing to that same partition, the commit can conflict. Restrict compaction to cold partitions, those at least one or two intervals behind the current streaming boundary.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;When Flink Is the Right Choice (and When It Isn&apos;t)&lt;/h2&gt;
&lt;p&gt;Flink is the right tool when you need stateful stream processing, not just event delivery. Joining a stream of user events to a slowly changing dimension table, computing windowed aggregations, or running real-time model inference in &lt;code&gt;ML_PREDICT&lt;/code&gt; table-valued functions; these require Flink&apos;s managed state, event-time handling, and exactly-once guarantees.&lt;/p&gt;
&lt;p&gt;If your use case is straightforward topic-to-table ingestion with no joins or transformations, Flink may be more infrastructure than you need. Kafka Connect with the Iceberg Sink connector handles simple ingestion with less operational overhead, though it lacks Flink&apos;s transformation capabilities and the full Dynamic Sink feature set.&lt;/p&gt;
&lt;p&gt;Where Flink&apos;s real-time lakehouse pattern becomes clearly superior is in multi-source, multi-table scenarios with evolving schemas. If you&apos;re ingesting 50 Kafka topics, performing lightweight enrichment from reference tables, and landing data into 50 Iceberg tables where new fields appear regularly, that&apos;s Flink&apos;s strongest use case and where the Dynamic Sink&apos;s automation provides direct operational savings.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The real-time lakehouse is not a marketing concept. It&apos;s a specific set of architectural decisions: Flink for stateful stream processing, the Dynamic Iceberg Sink for zero-downtime schema evolution and multi-table ingestion, exactly-once checkpointing for delivery guarantees, and native Flink table maintenance for compaction on cold partitions.&lt;/p&gt;
&lt;p&gt;Start with a checkpoint interval of 5 minutes, set your target file size to 128 MB, configure Schema Registry with full transitive compatibility, and don&apos;t compact the hot partition. Those four decisions alone will prevent most of the operational problems that make streaming lakehouses painful to run.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build on the Lakehouse&lt;/h3&gt;
&lt;p&gt;To go deeper on lakehouse architecture patterns, open table formats, and real-time data platforms, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt; for a comprehensive, hands-on treatment of these patterns in production environments.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you want to query your Iceberg tables with sub-second performance after they land from Flink, try Dremio Cloud free for 30 days at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Why Semantic Layers Make Enterprise Text-to-SQL Safer</title><link>https://iceberglakehouse.com/posts/2026-05-24-semantic-layers-text-to-sql/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-semantic-layers-text-to-sql/</guid><description>
Text-to-SQL generated serious excitement when early demonstrations showed AI assistants turning plain English into working SQL. It also generated ser...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Text-to-SQL generated serious excitement when early demonstrations showed AI assistants turning plain English into working SQL. It also generated serious skepticism from the analytics engineers who knew what those SQL queries were actually running against: messy schemas with inconsistent column naming, duplicate business logic spread across dozens of views, and metric definitions that varied by team.&lt;/p&gt;
&lt;p&gt;Raw text-to-SQL, meaning a large language model receiving a database schema and a question and generating SQL directly, produces accurate results on toy datasets and embarrassing results on enterprise schemas. Accuracy rates around 40% on real-world enterprise schemas have been reported across several industry evaluations. That&apos;s below the threshold where any responsible team deploys it to business users.&lt;/p&gt;
&lt;p&gt;The semantic layer changes this calculation. When the AI generates SQL against a well-maintained semantic model (where metrics like revenue and churn rate are precisely defined, dimensions are mapped, and synonyms are registered), accuracy climbs to 85–95% in multiple enterprise deployments. The difference isn&apos;t a better LLM. It&apos;s better context.&lt;/p&gt;
&lt;p&gt;This post covers how four different approaches to semantic layers enable reliable enterprise text-to-SQL: Dremio&apos;s natively integrated virtual dataset and reflections architecture, Snowflake Cortex Analyst with Semantic Views, the dbt Semantic Layer powered by MetricFlow, and how to choose between them.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Why Raw Text-to-SQL Fails in Enterprise Environments&lt;/h2&gt;
&lt;p&gt;Enterprise data warehouses are built by humans over years. Column names like &lt;code&gt;ord_amt&lt;/code&gt;, &lt;code&gt;revenue_adj&lt;/code&gt;, and &lt;code&gt;net_rev_usd&lt;/code&gt; might all represent variations of the same underlying concept in different tables, each adjusted for a different business rule. An LLM given raw schema DDL has no way to know which one to use for &amp;quot;total revenue by region last quarter&amp;quot; without additional context.&lt;/p&gt;
&lt;p&gt;Business logic is usually embedded in SQL transformations, not schema definitions. &lt;code&gt;daily_active_users&lt;/code&gt; might require a specific session window, a deduplication step, and an exclusion of internal traffic. None of that is visible from &lt;code&gt;SELECT * FROM users LIMIT 100&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There&apos;s also the terminology problem. A sales team&apos;s &amp;quot;customer&amp;quot; might join against &lt;code&gt;accounts&lt;/code&gt; in Salesforce sync data, while a support team&apos;s &amp;quot;customer&amp;quot; joins against &lt;code&gt;users&lt;/code&gt; in the product database. An LLM generating SQL against ambiguous schema names has no reliable way to distinguish these without documentation it doesn&apos;t have.&lt;/p&gt;
&lt;p&gt;Finally, most large enterprise schemas contain hundreds or thousands of tables. An LLM prompted with an entire schema doesn&apos;t have a useful understanding of which tables matter for which business questions, it&apos;s working with a phone book when it needs a guided directory.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Semantic Layer: Structured Context for AI&lt;/h2&gt;
&lt;p&gt;A semantic layer provides the translation layer between business concepts and database tables. At minimum it defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metrics&lt;/strong&gt;: Precisely computed business measures with their SQL definitions, filters, and aggregation logic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dimensions&lt;/strong&gt;: The attributes business users slice and filter by, with human-readable labels&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Joins&lt;/strong&gt;: How tables relate to each other for cross-entity queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Synonyms&lt;/strong&gt;: Alternative names business users might say for the same concept&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business descriptions&lt;/strong&gt;: Documentation that explains what each metric measures and how it&apos;s calculated&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When text-to-SQL is routed through a semantic layer, the AI doesn&apos;t generate SQL against raw schema; it generates SQL against a governed vocabulary of pre-defined metrics and dimensions. The generated SQL is guaranteed to use the correct table joins, the correct filters, and the correct aggregation logic because those definitions exist in the semantic model, not in the AI&apos;s general knowledge.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/semantic-layers-text-to-sql/semantic-layer-text-sql-routing.png&quot; alt=&quot;Semantic layer text-to-SQL routing architecture showing user question flowing through semantic router to either semantic model for trusted deterministic SQL or falling back to raw LLM text-to-SQL with human review&quot;&gt;&lt;/p&gt;
&lt;p&gt;The routing architecture works like this: a user submits a natural language question. A semantic router classifies the intent and determines whether the question can be answered using a defined metric or dimension from the semantic model. If yes, the question is routed to the semantic layer, which generates deterministic SQL using the metric definition. If no (the question is outside the semantic model&apos;s coverage), the system either falls back to raw text-to-SQL with human review gates, or returns a message asking the user to rephrase.&lt;/p&gt;
&lt;p&gt;This routing discipline is what makes the accuracy improvement so dramatic. Questions within the semantic model&apos;s coverage are answered deterministically, the SQL is generated from governed metric definitions, not LLM inference. Questions outside coverage either have a human review checkpoint or are declined gracefully. The system never silently generates plausible-but-wrong SQL from raw schema and serves it as a trusted answer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Dremio: Semantic Layer Natively Integrated with the Query Engine&lt;/h2&gt;
&lt;p&gt;Dremio takes a different architectural approach from standalone semantic layer tools. Instead of a separate service that sits between BI tools and a warehouse, Dremio&apos;s semantic layer is natively integrated into the query engine and catalog. This integration enables capabilities that are difficult to achieve with add-on semantic layers.&lt;/p&gt;
&lt;p&gt;The core of Dremio&apos;s semantic modeling is a three-tier virtual dataset architecture:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Preparation Layer:&lt;/strong&gt; A 1-to-1 mapping to source tables. These views handle cleansing, type casting, column renaming, and normalization. No business logic lives here, just the transformations needed to make raw data consistent and usable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Business Layer:&lt;/strong&gt; Where business logic and metric definitions live. Joins between entities, calculated metrics, and approved business definitions are encoded here. This is the layer an LLM or BI tool should be reasoning about when answering business questions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Application Layer:&lt;/strong&gt; Tailored views optimized for specific consumers, a BI dashboard, an AI agent, a data science notebook. These views are narrow, purpose-built, and carry the precise definitions their consumers need.&lt;/p&gt;
&lt;p&gt;This layering creates a stable semantic surface that AI tools, BI dashboards, and data science notebooks all consume from the same governed definitions. A metric defined in the business layer propagates to all consumers automatically.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/semantic-layers-text-to-sql/dremio-semantic-layer-three-tier.png&quot; alt=&quot;Dremio three-tier semantic layer architecture showing Preparation Layer at bottom connecting to Apache Iceberg tables, Business Layer in the middle with metric definitions and KPIs, Application Layer at top serving BI dashboards, AI text-to-SQL agents, and data science notebooks, with Dremio Reflections on the right accelerating the Business Layer for sub-second query performance&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Reflections: Performance Without Data Movement&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s Reflections feature adds an acceleration dimension that most semantic layers can&apos;t match. A Reflection is a materialized, optimized view of a dataset or aggregation that Dremio maintains automatically. When a query hits a dataset covered by a Reflection, Dremio transparently rewrites the query to use the optimized materialization instead of re-running the raw join and aggregation logic.&lt;/p&gt;
&lt;p&gt;For text-to-SQL use cases, this means the semantic layer isn&apos;t just providing correct context, it&apos;s also providing fast results. When an AI assistant routes a natural language question through Dremio&apos;s semantic model, the resulting SQL benefits from Reflection-based acceleration without requiring the AI to know anything about the underlying physical optimization.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create an aggregation reflection for revenue analytics
-- This materializes the join and aggregation, accelerating downstream text-to-SQL
ALTER DATASET &amp;quot;business_layer&amp;quot;.&amp;quot;revenue_analytics&amp;quot;
CREATE AGGREGATE REFLECTION &amp;quot;revenue_daily_agg&amp;quot;
USING DISPLAY (region, product_category)
DIMENSIONS (region, product_category, order_date)
MEASURES (total_revenue BY SUM, order_count BY COUNT);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio analyzes query patterns and recommends Reflections automatically. The system then rewrites incoming queries to use the appropriate Reflection, delivering sub-second responses on queries that would otherwise require expensive multi-table joins.&lt;/p&gt;
&lt;h3&gt;Generative AI Integration: Automatic Metadata Generation&lt;/h3&gt;
&lt;p&gt;Dremio has integrated generative AI directly into its semantic layer metadata management. The system can automatically generate wikis, labels, and descriptions for tables and virtual datasets, making data more discoverable without requiring manual documentation efforts.&lt;/p&gt;
&lt;p&gt;For text-to-SQL accuracy, this automatic metadata generation directly improves the context available to AI models. When column descriptions, business definitions, and usage notes are automatically maintained and up-to-date, the AI has richer, more accurate context to draw from when generating SQL.&lt;/p&gt;
&lt;p&gt;Natural language discovery (finding datasets by describing what you&apos;re looking for in plain English rather than knowing specific table names), further extends the semantic layer&apos;s value. A business analyst who doesn&apos;t know that revenue data lives in &lt;code&gt;fct_orders&lt;/code&gt; can describe &amp;quot;I need revenue by customer segment for Q1&amp;quot; and Dremio&apos;s catalog surfaces the appropriate dataset automatically.&lt;/p&gt;
&lt;h3&gt;Governed Access Through the Semantic Layer&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s semantic layer includes built-in fine-grained access control. Row-level security and column masking policies apply through the virtual dataset layer, which means business users querying a &amp;quot;Revenue by Region&amp;quot; dataset automatically see only the regions they&apos;re authorized for, without requiring the AI or the BI tool to implement access filtering.&lt;/p&gt;
&lt;p&gt;This is architecturally significant for AI use cases. When an LLM generates SQL against a Dremio virtual dataset that has row-level security configured, the row filter is enforced at execution time by the query engine. The AI doesn&apos;t need to know about access policies, they&apos;re invisible to the query generation layer but always enforced.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Snowflake Cortex Analyst&lt;/h2&gt;
&lt;p&gt;Snowflake Cortex Analyst is Snowflake&apos;s native managed text-to-SQL service. It&apos;s designed to work with Snowflake Semantic Views, objects defined in Snowflake&apos;s metadata layer that describe metrics, measures, and dimension relationships.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Define a Snowflake Semantic View for revenue analytics
CREATE OR REPLACE SEMANTIC VIEW revenue_analytics AS
    SELECT
        o.order_date,
        c.region,
        SUM(o.amount) AS total_revenue,
        COUNT(DISTINCT o.customer_id) AS unique_customers
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    WHERE o.status = &apos;completed&apos;
    GROUP BY 1, 2;

-- Annotate with semantic metadata
COMMENT ON SEMANTIC VIEW revenue_analytics IS
    &apos;Daily revenue by region for completed orders&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cortex Analyst uses the semantic view definitions to constrain its SQL generation. A user asking &amp;quot;what was revenue in the west region last week?&amp;quot; generates a SQL query against the pre-defined &lt;code&gt;total_revenue&lt;/code&gt; metric with the &lt;code&gt;region&lt;/code&gt; and &lt;code&gt;order_date&lt;/code&gt; filters applied correctly, not an ad-hoc query that might join the wrong tables.&lt;/p&gt;
&lt;p&gt;The Cortex Analyst API returns both the SQL it generated and the underlying semantic view it used, providing full transparency about the query generation process. This auditability matters for enterprise deployments where understanding why a query was generated a certain way is as important as the result.&lt;/p&gt;
&lt;p&gt;Cortex Analyst is Snowflake-specific. The accuracy advantages it provides apply within Snowflake environments, and the semantic views cannot be ported to Databricks, BigQuery, or other engines.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;dbt Semantic Layer and MetricFlow&lt;/h2&gt;
&lt;p&gt;The dbt Semantic Layer, powered by MetricFlow, takes a different architectural approach. Metrics are defined in YAML files in a dbt project, versioned in Git alongside the SQL models that provide the underlying data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# metrics/revenue.yml
semantic_models:
  - name: orders
    defaults:
      agg_time_dimension: order_date
    model: ref(&apos;fct_orders&apos;)
    entities:
      - name: order
        type: primary
        expr: order_id
      - name: customer
        type: foreign
        expr: customer_id
    measures:
      - name: total_revenue
        agg: sum
        expr: amount
        filter: &amp;quot;status = &apos;completed&apos;&amp;quot;
    dimensions:
      - name: region
        type: categorical
        expr: region
      - name: order_date
        type: time
        type_params:
          time_granularity: day

metrics:
  - name: revenue
    type: simple
    type_params:
      measure: total_revenue
    label: &amp;quot;Total Revenue&amp;quot;
    description: &amp;quot;Sum of completed order amounts&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because the metric definitions are code in a Git repository, they go through the same review processes as SQL models. Changes to metric definitions are auditable. Teams can see the history of how a metric definition evolved and who approved each change.&lt;/p&gt;
&lt;p&gt;The dbt Semantic Layer operates as a service that sits between BI tools and the data warehouse. Tableau, Power BI, Hex, and other supported BI tools query the semantic layer using the MetricFlow API, which translates the metric requests into warehouse-native SQL. AI tools that integrate with dbt&apos;s API can use the same metric definitions for text-to-SQL generation.&lt;/p&gt;
&lt;p&gt;The key advantage of the dbt approach is portability. The same metric YAML definitions work against Snowflake, BigQuery, Databricks, Redshift, and other supported warehouses. Organizations that want to run multi-warehouse experiments or migrate between warehouses carry their metric definitions with them in the same Git repository.&lt;/p&gt;
&lt;p&gt;The limitation is coupling to the dbt ecosystem. Teams that don&apos;t use dbt for transformation logic face a significant setup cost to build out dbt models as the foundation for semantic model definitions.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Building the Synonym and Description Library&lt;/h2&gt;
&lt;p&gt;Regardless of which semantic layer tool you use, the investment in synonym and description management is what separates good text-to-SQL implementations from great ones.&lt;/p&gt;
&lt;p&gt;Business users don&apos;t ask questions in schema language. They ask:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;How many customers&amp;quot; (not &amp;quot;COUNT DISTINCT customer_id&amp;quot;)&lt;/li&gt;
&lt;li&gt;&amp;quot;Revenue&amp;quot; (not &amp;quot;SUM(amount) WHERE status = &apos;completed&apos;&amp;quot;)&lt;/li&gt;
&lt;li&gt;&amp;quot;Last month&amp;quot; (not &amp;quot;WHERE order_date &amp;gt;= DATE_TRUNC(&apos;month&apos;, CURRENT_DATE - INTERVAL 1 MONTH)&amp;quot;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A semantic layer that only maps technical terms to metric definitions still requires users to know the technical vocabulary. One with a rich synonym library handles the natural language variation users actually produce.&lt;/p&gt;
&lt;p&gt;In Dremio, synonyms and business descriptions are maintained alongside virtual dataset definitions in the catalog. In dbt, descriptions are YAML metadata fields. In Snowflake, semantic view annotations and Cortex-specific metadata files provide the synonym mapping.&lt;/p&gt;
&lt;p&gt;Practical synonym management requires a feedback loop: when text-to-SQL questions fail to route correctly, the routing failures are logged, reviewed, and used to add new synonyms. Teams that treat synonym management as a one-time setup task see accuracy plateau. Teams that maintain a feedback loop see accuracy improve over time as the semantic model&apos;s vocabulary coverage expands.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Choosing Between Dremio, Snowflake Cortex Analyst, and dbt&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/semantic-layers-text-to-sql/cortex-vs-dbt-semantic-comparison.png&quot; alt=&quot;Comparison table showing Dremio, Snowflake Cortex Analyst, and dbt Semantic Layer across architecture, best use case, portability, acceleration, governance, and setup complexity dimensions&quot;&gt;&lt;/p&gt;
&lt;p&gt;The choice between semantic layer approaches comes down to four factors:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Platform coupling:&lt;/strong&gt; If your analytics platform is Snowflake-native, Cortex Analyst provides the lowest-friction path with no external service dependencies. If you&apos;re multi-cloud or plan to stay engine-agnostic, dbt Semantic Layer or Dremio&apos;s virtual dataset approach provide more portability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse architecture:&lt;/strong&gt; If you&apos;re building on Apache Iceberg in a cloud object store and want cross-engine query access with built-in acceleration, Dremio&apos;s integrated semantic layer is purpose-built for this. The Reflections system provides a materialization strategy that serves both BI and AI query workloads without requiring a separate caching layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance requirements:&lt;/strong&gt; Dremio&apos;s native integration with its query engine means access control policies apply at the semantic layer and propagate to all query paths, SQL, BI, and AI-generated. This reduces the surface area where access policies can be bypassed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Team skills:&lt;/strong&gt; dbt Semantic Layer requires analytics engineering investment in YAML metric definitions and model maintenance. Snowflake Cortex Analyst requires SQL DDL for semantic views. Dremio&apos;s virtual dataset approach requires SQL-based view building but benefits from a guided UI and AI-assisted metadata generation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The AI Reliability Improvement in Practice&lt;/h2&gt;
&lt;p&gt;The jump from 40% to 85–95% accuracy on enterprise text-to-SQL questions doesn&apos;t come uniformly. Accuracy improvements are sharpest on:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Questions about standard business metrics (revenue, churn, DAU) that are fully defined in the semantic model&lt;/li&gt;
&lt;li&gt;Questions involving time period filters (&amp;quot;last quarter&amp;quot;, &amp;quot;year to date&amp;quot;) that the semantic layer maps to correct date expressions&lt;/li&gt;
&lt;li&gt;Questions that join entities that the semantic model has pre-defined relationships for&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Accuracy improvements are smallest on:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Questions requiring logic not captured in the semantic model&lt;/li&gt;
&lt;li&gt;Questions that span multiple domains without pre-defined cross-domain joins&lt;/li&gt;
&lt;li&gt;Complex analytical questions requiring window functions or advanced SQL that the semantic layer doesn&apos;t expose&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is why semantic model coverage expansion is an ongoing practice, not a one-time project. Each category of unanswered questions represents an opportunity to extend the semantic model&apos;s coverage and push accuracy higher.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Text-to-SQL without a semantic layer is an interesting demo. Text-to-SQL grounded in a well-maintained semantic model is an enterprise capability. The jump from 40% to 85–95% accuracy isn&apos;t free; it requires investment in defining metrics, maintaining synonyms, and extending semantic model coverage as the business evolves. But that investment is far lower than the alternative: building and maintaining approval workflows for every AI-generated SQL query that analytics users need reviewed before acting on.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s integrated approach (native semantic layer, automatic Reflection acceleration, AI-assisted metadata generation, and cross-engine governance), offers a particularly compelling path for organizations building on Iceberg lakehouses. For Snowflake-native shops, Cortex Analyst provides managed text-to-SQL without infrastructure overhead. For multi-warehouse environments with analytics engineering teams, dbt Semantic Layer provides the best portability and code-first governance.&lt;/p&gt;
&lt;p&gt;The semantic layer turns AI-generated analytics from a liability into a controlled surface. That&apos;s the version enterprise teams can actually trust.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Go Deeper on Data Reliability&lt;/h3&gt;
&lt;p&gt;For comprehensive guidance on building reliable, governed data architectures, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s natively integrated semantic layer, with AI-assisted metadata generation, Reflections acceleration, and fine-grained access control, makes it the ideal foundation for trusted enterprise text-to-SQL and AI analytics. Try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Choosing Vector Stores for Retrieval Workloads</title><link>https://iceberglakehouse.com/posts/2026-05-24-vector-stores-retrieval/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-24-vector-stores-retrieval/</guid><description>
Vector retrieval has become a standard component in data platform architectures, not just an ML research topic. RAG pipelines use it to retrieve docu...</description><pubDate>Sun, 24 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Vector retrieval has become a standard component in data platform architectures, not just an ML research topic. RAG pipelines use it to retrieve document context before generation. Recommendation systems use it to find similar items. Search applications use it to retrieve semantically relevant results that keyword search misses.&lt;/p&gt;
&lt;p&gt;The vector store market has matured rapidly. pgvector brings approximate nearest neighbor (ANN) search to PostgreSQL. Milvus provides a purpose-built distributed vector database designed for billions of vectors. Weaviate integrates hybrid dense and sparse search with a multi-modal retrieval model. LanceDB uses the Lance columnar format for disk-native vector retrieval optimized for ML workflows.&lt;/p&gt;
&lt;p&gt;Each of these tools makes different tradeoffs that matter in practice. This guide is about those tradeoffs, not which tool markets itself best, but which tool fits specific workload and operational requirements.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Index Types: HNSW vs IVFFlat vs DiskANN&lt;/h2&gt;
&lt;p&gt;The index algorithm determines the recall-latency tradeoff for approximate nearest neighbor search.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HNSW (Hierarchical Navigable Small World)&lt;/strong&gt; builds a multi-layer proximity graph. Queries traverse the graph from coarse to fine layers, achieving high recall with low latency. HNSW is the dominant choice for high-performance vector retrieval. Its limitation is memory: the graph must fit in RAM, making it unsuitable for datasets larger than available memory on a single node.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IVFFlat (Inverted File with Flat Quantization)&lt;/strong&gt; partitions vectors into clusters and searches only the nearest clusters during query. It uses less memory than HNSW at the cost of lower recall for the same search effort. IVFFlat is appropriate when memory is constrained and some recall degradation is acceptable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DiskANN&lt;/strong&gt; (used by Milvus) is designed for billion-scale datasets where HNSW&apos;s memory requirements are prohibitive. It stores the index partially on disk and partially in RAM, trading some latency for dramatically better scale. For datasets in the tens of billions of vectors, DiskANN is often the only practical option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IVF-PQ (used by LanceDB)&lt;/strong&gt; combines inverted file indexing with Product Quantization, which compresses vectors before indexing. This enables disk-native storage of very large vector datasets without requiring the full vector in memory during search.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;pgvector: Vector Search Inside PostgreSQL&lt;/h2&gt;
&lt;p&gt;pgvector extends PostgreSQL with vector data types, indexes, and similarity search operations. If you&apos;re already using PostgreSQL, adding vector search is an extension install and schema change.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Enable pgvector extension
CREATE EXTENSION vector;

-- Create a table with a vector column
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536),  -- OpenAI text-embedding-3-small dimensions
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create an HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Semantic similarity search
SELECT id, content,
       1 - (embedding &amp;lt;=&amp;gt; $1::vector) AS similarity
FROM documents
WHERE created_at &amp;gt; NOW() - INTERVAL &apos;30 days&apos;
ORDER BY embedding &amp;lt;=&amp;gt; $1::vector
LIMIT 10;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;pgvector&apos;s operational advantage is zero new infrastructure. Your existing PostgreSQL setup, backup procedures, replication topology, and tooling all apply to vector columns without modification. The limitation is scale: HNSW indexes must fit in RAM, which practically limits pgvector to datasets of millions of vectors on typical server configurations.&lt;/p&gt;
&lt;p&gt;For hybrid search (combining dense vector similarity with keyword (BM25) relevance), pgvector uses PostgreSQL&apos;s native &lt;code&gt;tsvector&lt;/code&gt; full-text search in combination with vector search, joined by RRF (Reciprocal Rank Fusion) or similar fusion scoring. This requires more manual implementation than Milvus or Weaviate&apos;s native hybrid search capabilities.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Milvus: Purpose-Built for Scale&lt;/h2&gt;
&lt;p&gt;Milvus is a purpose-built vector database designed for production workloads at billions of vectors. It supports multiple index types (HNSW, IVF, DiskANN), hardware-accelerated search (SIMD, GPU), and a native hybrid search pipeline that combines dense and sparse retrieval in a single query.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pymilvus import connections, Collection, DataType, FieldSchema, CollectionSchema

# Connect to Milvus
connections.connect(&amp;quot;default&amp;quot;, host=&amp;quot;localhost&amp;quot;, port=&amp;quot;19530&amp;quot;)

# Define a collection schema
schema = CollectionSchema(fields=[
    FieldSchema(&amp;quot;id&amp;quot;, DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(&amp;quot;text&amp;quot;, DataType.VARCHAR, max_length=65535),
    FieldSchema(&amp;quot;dense_embedding&amp;quot;, DataType.FLOAT_VECTOR, dim=1536),
    FieldSchema(&amp;quot;sparse_embedding&amp;quot;, DataType.SPARSE_FLOAT_VECTOR)
])

collection = Collection(&amp;quot;documents&amp;quot;, schema)

# Hybrid search: combine dense and sparse retrieval
from pymilvus import AnnSearchRequest, WeightedRanker

dense_req = AnnSearchRequest(
    data=[query_dense_embedding],
    anns_field=&amp;quot;dense_embedding&amp;quot;,
    param={&amp;quot;nprobe&amp;quot;: 20},
    limit=50
)

sparse_req = AnnSearchRequest(
    data=[query_sparse_embedding],
    anns_field=&amp;quot;sparse_embedding&amp;quot;,
    param={},
    limit=50
)

results = collection.hybrid_search(
    reqs=[dense_req, sparse_req],
    rerank=WeightedRanker(0.8, 0.2),  # 80% dense, 20% sparse
    limit=10
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Milvus&apos;s hybrid search combines BGEM3 sparse embeddings (or BM25-style representations) with dense embeddings in a single query with configurable weighting. This is particularly valuable for domain-specific retrieval where exact term matching (sparse) and semantic similarity (dense) both contribute signal.&lt;/p&gt;
&lt;p&gt;The operational cost of Milvus is high. It requires running etcd (for metadata), MinIO or S3 (for persistence), and multiple service components (proxy, query nodes, data nodes, index nodes). For teams without dedicated infrastructure, Zilliz Cloud provides a managed Milvus service.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Weaviate: Hybrid Search and Multi-Modal Retrieval&lt;/h2&gt;
&lt;p&gt;Weaviate implements hybrid search by combining HNSW-based dense vector search with BM25 sparse search, with the fusion handled natively:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import weaviate
from weaviate.classes.query import HybridFusion

client = weaviate.connect_to_local()
collection = client.collections.get(&amp;quot;Documents&amp;quot;)

# Hybrid search with BM25 + dense vector
results = collection.query.hybrid(
    query=&amp;quot;machine learning model deployment&amp;quot;,  # Used for both BM25 and embedding
    fusion_type=HybridFusion.RELATIVE_SCORE,
    alpha=0.75,  # 0=pure BM25, 1=pure vector, 0.75=mostly vector
    limit=10,
    return_metadata=weaviate.classes.query.MetadataQuery(score=True)
)

for obj in results.objects:
    print(f&amp;quot;Score: {obj.metadata.score}, Content: {obj.properties[&apos;content&apos;][:100]}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Weaviate&apos;s &lt;code&gt;alpha&lt;/code&gt; parameter controls the blend between sparse and dense retrieval. For domain-specific technical content where terminology matters, lower alpha (more BM25 weight) often improves precision. For general semantic retrieval, higher alpha (more vector weight) captures meaning better.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;LanceDB: Disk-Native for ML Workflows&lt;/h2&gt;
&lt;p&gt;LanceDB uses the Lance columnar format, a format designed for efficient random access alongside columnar scan performance. Unlike HNSW-based stores that require indexes in RAM, LanceDB&apos;s IVF-PQ index is disk-native, making it practical for large-scale ML datasets that don&apos;t fit in memory.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import lancedb
import numpy as np

db = lancedb.connect(&amp;quot;./my-lance-db&amp;quot;)

# Create a table
table = db.create_table(&amp;quot;embeddings&amp;quot;, data=[
    {&amp;quot;id&amp;quot;: 1, &amp;quot;text&amp;quot;: &amp;quot;example document&amp;quot;, &amp;quot;vector&amp;quot;: np.random.rand(1536).tolist()},
])

# Query for nearest neighbors
results = table.search(query_vector) \
    .metric(&amp;quot;cosine&amp;quot;) \
    .limit(10) \
    .to_pandas()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LanceDB integrates with DuckDB for SQL-based analytics on the same dataset; you can run aggregation queries and vector similarity searches against the same Lance table without data movement. This is particularly useful for ML workflows where you need both analytical queries (row counts by label, feature statistics) and retrieval queries (find similar training examples).&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Selection Guide&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/vector-stores-retrieval/vector-store-selection-guide.png&quot; alt=&quot;Vector store comparison table showing pgvector, Milvus, Weaviate, and LanceDB across index type, hybrid search support, operational complexity, data scale, SQL interface, cloud managed option, and best use case&quot;&gt;&lt;/p&gt;
&lt;p&gt;The primary decision factors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Existing PostgreSQL investment + millions of vectors:&lt;/strong&gt; pgvector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Billions of vectors + production scale-out:&lt;/strong&gt; Milvus / Zilliz Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-modal hybrid search + moderate scale:&lt;/strong&gt; Weaviate&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ML/AI workflows + disk-native large datasets:&lt;/strong&gt; LanceDB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All four options support cloud-managed deployments, reducing the operational burden of running infrastructure.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Embedding Model Choice and Dimension Tradeoffs&lt;/h2&gt;
&lt;p&gt;The vector store choice is only half of the retrieval architecture decision. The embedding model determines vector dimensionality, quality of semantic similarity, and encoding latency, all of which affect the operational characteristics of the vector store.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OpenAI text-embedding-3-small:&lt;/strong&gt; 1536 dimensions. Good general-purpose text retrieval quality with low cost. Works well with all four vector stores.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OpenAI text-embedding-3-large:&lt;/strong&gt; 3072 dimensions. Higher quality at higher cost and storage. Memory requirements for HNSW indexes scale with dimensionality, moving from 1536 to 3072 dimensions approximately doubles the HNSW index size.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cohere embed-v3:&lt;/strong&gt; 1024 dimensions with strong multilingual performance. Lower dimensionality reduces storage and memory costs while maintaining competitive retrieval quality for multilingual content.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BGE-M3 (BAAI):&lt;/strong&gt; Produces both dense embeddings and sparse representations in a single model pass. This is the foundation for Milvus hybrid search, dense and sparse representations from the same model, enabling highly effective hybrid retrieval without running separate embedding and BM25 pipelines.&lt;/p&gt;
&lt;p&gt;Dimensionality matters operationally because HNSW indexes must fit in RAM. For a 1-million-vector dataset with 1536 dimensions using float32 encoding:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1,000,000 vectors × 1,536 dimensions × 4 bytes = 6.1 GB (vectors alone)
HNSW graph overhead ≈ 30-40 additional bytes per vector × 1,000,000 = 30-40 MB
Total index memory ≈ 6.1 GB + graph overhead
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a server with 16 GB RAM, this is feasible. For 10 million vectors at 3072 dimensions, the math exceeds 120 GB; requiring IVFFlat (which can use less RAM at the cost of recall), DiskANN, or LanceDB&apos;s IVF-PQ disk-native approach.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Hybrid Search: Dense + Sparse in Practice&lt;/h2&gt;
&lt;p&gt;Hybrid search combines dense vector retrieval (semantic similarity) with sparse keyword retrieval (BM25/TF-IDF term matching). The intuition is that dense retrieval handles paraphrase and synonym matches well but can miss precise technical terms; sparse retrieval handles exact term matching well but misses semantic equivalents.&lt;/p&gt;
&lt;p&gt;For domain-specific retrieval (internal documentation, legal texts, medical literature), hybrid search typically outperforms either approach alone by 10-20% on NDCG@10 benchmarks.&lt;/p&gt;
&lt;p&gt;The fusion strategy combines results from both retrievers. Reciprocal Rank Fusion (RRF) is the most common:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
    &amp;quot;&amp;quot;&amp;quot;
    Combine dense and sparse retrieval results using RRF.
    k=60 is the standard constant (empirically good across many benchmarks).
    &amp;quot;&amp;quot;&amp;quot;
    scores = {}

    for rank, doc in enumerate(dense_results):
        doc_id = doc[&amp;quot;id&amp;quot;]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(sparse_results):
        doc_id = doc[&amp;quot;id&amp;quot;]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Sort by combined score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Milvus and Weaviate implement RRF and weighted score fusion natively. For pgvector, RRF requires implementing the fusion logic in application code or PostgreSQL SQL.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Index Tuning for Production Performance&lt;/h2&gt;
&lt;p&gt;Each index type has tunable parameters that trade recall for speed and memory:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HNSW tuning (pgvector, Weaviate, Milvus):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- pgvector HNSW with tuned parameters
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (
    m = 16,              -- Max connections per layer (higher = better recall, more memory)
    ef_construction = 64  -- Build-time candidate set size (higher = better quality, slower build)
);

-- At query time, increase ef_search for higher recall (at latency cost)
SET hnsw.ef_search = 200;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IVFFlat tuning (pgvector fallback):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);  -- More lists = lower recall per probe, fewer lists = more memory scanned

-- Query-time: increase probes for higher recall
SET ivfflat.probes = 50;  -- Query 50 out of 1000 lists
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The recall-latency tradeoff is real and measurable. For production deployments, establish a minimum recall threshold (often 95% recall at top-10) and tune ef_search or probes to meet that threshold with the lowest latency at expected QPS.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Production Monitoring for Vector Retrieval&lt;/h2&gt;
&lt;p&gt;Vector stores require different monitoring than traditional databases. Beyond standard infrastructure metrics (CPU, memory, disk), retrieval quality monitoring is essential:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Recall monitoring.&lt;/strong&gt; Periodically evaluate retrieval recall against a labeled ground-truth test set. If recall drops (because index tuning drifted from optimal or the data distribution changed), it&apos;s often invisible in latency metrics but visible in downstream task performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Embedding freshness.&lt;/strong&gt; If documents are updated without re-embedding and re-indexing, retrieval returns stale results for updated content. Monitor the gap between document update timestamps and their embedding update timestamps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Distribution drift.&lt;/strong&gt; When the query embedding distribution drifts significantly from the indexed document distribution (because a new use case is generating different query types), retrieval quality degrades. Monitoring average cosine similarity of returned results provides a signal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ANN algorithm performance.&lt;/strong&gt; For HNSW, monitor index build time and memory usage as the dataset grows. For LanceDB IVF-PQ, monitor the number of partitions scanned per query.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Security Considerations for Vector Stores&lt;/h2&gt;
&lt;p&gt;Enterprise vector stores present security challenges that are different from traditional databases. The primary concern is what has been called &amp;quot;embedding inversion&amp;quot;, the theoretical and practical ability to reconstruct the original text from an embedding vector.&lt;/p&gt;
&lt;p&gt;For most practical enterprise deployments, embedding inversion is not a realistic attack vector against the embedding vectors themselves. Current embedding models produce high-dimensional representations where reconstruction is computationally impractical. However, the threat model matters: if vectors are stored in a system accessible to untrusted parties, the combination of vector similarity search and inference can reveal membership information (whether a specific document is in the database), even if the document text itself isn&apos;t returned.&lt;/p&gt;
&lt;p&gt;The practical security controls for enterprise vector stores:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Access control at the retrieval layer.&lt;/strong&gt; Vector search results should go through the same access control checks as any other data retrieval. For multi-tenant deployments where different user groups should only retrieve documents from their namespace, enforce namespace isolation in the query filter, never retrieve across tenants and filter after the fact.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Encryption at rest.&lt;/strong&gt; All four vector stores discussed in this post support encrypted storage. For compliance-sensitive environments (HIPAA, SOC 2 Type II), verify that encryption applies to both the vector data and the associated metadata fields.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Audit logging for similarity search queries.&lt;/strong&gt; Unlike SQL queries, similarity search queries don&apos;t have a natural human-readable representation in audit logs. Log the query embedding (or a hash of it), the number of results returned, the user identity, and the timestamp. The embedding itself can be stored encrypted for forensic purposes without being readable in normal audit review.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tokenization and content filtering.&lt;/strong&gt; For RAG applications serving external users, the content retrieved by vector search should pass through a content filter before inclusion in the LLM prompt. Adversarial documents in the corpus can attempt to manipulate the LLM&apos;s behavior through retrieval, a technique called &amp;quot;indirect prompt injection.&amp;quot; Filtering retrieved content against a predefined allowlist of acceptable content patterns reduces this risk.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Evaluating Retrieval Quality: Beyond &amp;quot;Does It Return Results&amp;quot;&lt;/h2&gt;
&lt;p&gt;One of the most underinvested areas in production RAG and retrieval systems is systematic evaluation of retrieval quality. Teams often measure whether the system returns results, but rarely measure whether it returns the right results, and whether retrieval quality is stable over time as the document corpus and query distribution evolve.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Recall@K&lt;/strong&gt; is the primary retrieval quality metric: given a query for which the ground-truth relevant documents are known, what fraction of those relevant documents appear in the top K results? A Recall@10 of 0.85 means the system returns 8-9 of 10 relevant documents in its first page of results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mean Reciprocal Rank (MRR)&lt;/strong&gt; measures where the first relevant result appears. If the first relevant result is always in position 1, MRR = 1.0. If it typically appears at position 3 or 4, MRR ≈ 0.3. For RAG applications where the LLM uses the top 3-5 retrieved documents, high MRR is more important than high Recall at large K values.&lt;/p&gt;
&lt;p&gt;Building a retrieval evaluation suite requires a labeled dataset of query-document relevance pairs. For internal enterprise deployments, this can be bootstrapped from historical query logs combined with analyst feedback on result quality. Even a small evaluation set of 200-300 labeled queries provides enough signal to detect retrieval regressions when index parameters or embedding models change.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The vector store landscape in 2026 has matured from a niche ML research tool to production infrastructure for enterprise search and AI applications. pgvector, Milvus, Weaviate, and LanceDB each address different points in the scale-complexity-capability tradeoff.&lt;/p&gt;
&lt;p&gt;The optimal architecture choice depends on current data scale, operational team capacity, existing infrastructure investments, and the specific retrieval quality requirements of the application. For teams starting fresh in a mid-scale environment (1-100 million vectors), Weaviate provides the best balance of hybrid search capability, manageable operations, and cloud-managed deployment options. Teams already running PostgreSQL should evaluate pgvector as a zero-new-infrastructure option before investing in a purpose-built vector database.&lt;/p&gt;
&lt;p&gt;At billion-vector scale, the choice narrows to Milvus (in-memory with DiskANN for large indexes) or LanceDB (fully disk-native). The operational overhead of Milvus is significant but manageable for teams with dedicated infrastructure capacity.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Explore Further&lt;/h3&gt;
&lt;p&gt;For comprehensive coverage of AI-native data architectures, vector retrieval patterns, and lakehouse integration, pick up &lt;a href=&quot;https://www.amazon.com/dp/B0GQNY21TD&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner&apos;s Guide to Modern Data Architecture, Open Table Formats, and Agentic AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Browse Alex&apos;s other data engineering and analytics books at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For federated analytics across your data products including vector-enriched datasets, try Dremio Cloud free at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Single-Node Data Engineering: DuckDB, DataFusion, Polars, and LakeSail</title><link>https://iceberglakehouse.com/posts/2026-05-23-single-node-data-engineering-duckdb-datafusion-polars-lakesail/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-23-single-node-data-engineering-duckdb-datafusion-polars-lakesail/</guid><description>
For the past decade, data engineering was synonymous with distributed clusters. If your dataset exceeded a few gigabytes, standard practice dictated ...</description><pubDate>Sat, 23 May 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;For the past decade, data engineering was synonymous with distributed clusters. If your dataset exceeded a few gigabytes, standard practice dictated spinning up an Apache Spark cluster on AWS EMR or Databricks. This distributed paradigm introduced massive operational complexity: managing JVM configurations, allocating executors, tuning shuffle partitions, and paying a substantial &amp;quot;serialization tax&amp;quot; to move data across network sockets and language runtimes.&lt;/p&gt;
&lt;p&gt;Recently, the data engineering landscape has experienced a single-node renaissance. Rather than scaling out to distributed clusters, teams are scaling up on single machines. Modern laptops ship with 12 or more CPU cores, fast NVMe SSDs capable of multi-gigabyte-per-second read throughput, and up to 128 GB of RAM. Cloud providers offer single virtual machines with hundreds of cores and terabytes of memory for a fraction of the cost of a Kubernetes or Spark cluster.&lt;/p&gt;
&lt;p&gt;This physical hardware evolution is only half the story. The true catalyst is a new generation of data technologies built on Apache Arrow, vectorized execution, and out-of-core memory management. Tools like DuckDB, Apache Arrow DataFusion, Polars, and LakeSail enable a single laptop or VM to process hundreds of gigabytes: and even terabytes, of data. You can now execute complex analytical pipelines locally or on a single node without the overhead of a distributed JVM runtime.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/single-node-ecosystem.png&quot; alt=&quot;Architecture diagram showing the single-node data engineering ecosystem from local laptops to single-node engines querying S3&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Core Foundations: Columnar Memory and Apache Arrow&lt;/h2&gt;
&lt;p&gt;To understand how single-node data engineering can process datasets that previously required hundreds of cluster nodes, you must look at how data is structured in memory.&lt;/p&gt;
&lt;p&gt;Traditional databases and processing runtimes designed for transactional workloads (OLTP) use row-oriented layouts. They store all fields of a single record contiguously in memory: &lt;code&gt;[User_ID, Age, Name]&lt;/code&gt;, followed by the next record. When executing analytical queries (OLAP) that only target a subset of columns (such as calculating the average age of users), a row-oriented engine must scan the entire record structure from memory. This process loads irrelevant data (like names and IDs) into the CPU&apos;s L1/L2 caches, leading to cache pollution and wasted memory bandwidth.&lt;/p&gt;
&lt;p&gt;Columnar query engines solve this inefficiency by storing data contiguously by column: &lt;code&gt;[Age, Age, Age]&lt;/code&gt; in one buffer, and &lt;code&gt;[Name, Name, Name]&lt;/code&gt; in another. The CPU only reads the specific columns required by the query.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Row-Oriented Layout (OLTP):
┌──────────────────────────────┬──────────────────────────────┐
│ ID 1 │ Age 1 │ Name 1        │ ID 2 │ Age 2 │ Name 2        │
└──────────────────────────────┴──────────────────────────────┘

Columnar Layout (Arrow/OLAP):
┌──────────┬──────────┐ ┌──────────┬──────────┐ ┌──────────┬──────────┐
│ ID 1     │ ID 2     │ │ Age 1    │ Age 2    │ │ Name 1   │ Name 2   │
└──────────┴──────────┘ └──────────┴──────────┘ └──────────┴──────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apache Arrow standardizes this columnar memory layout. It defines an open-source, language-independent specification for in-memory columnar data. By establishing a shared memory format, Arrow eliminates the serialization tax that historically slowed down data pipelines.&lt;/p&gt;
&lt;p&gt;In traditional architectures, passing data between a Python script and a Java or C++ engine required serializing the data into a byte stream (like JSON or Protobuf) and deserializing it on the other side. This serialization tax frequently consumed up to 80% of the total query execution time.&lt;/p&gt;
&lt;p&gt;Arrow enables zero-copy Inter-Process Communication (IPC). Because Arrow represents data in memory exactly the same way across Python, Rust, and C++, different processes can memory-map (mmap) the same physical memory buffers. An engine can pass a dataset to Python for machine learning or visualization by exchanging memory pointers. No bytes are copied, and no serialization occurs.&lt;/p&gt;
&lt;p&gt;Furthermore, Arrow&apos;s contiguous memory alignment matches the layout of modern CPU cache lines, making it straightforward to utilize Single Instruction, Multiple Data (SIMD) instruction sets (such as AVX-512 on Intel/AMD or Neon on ARM). SIMD allows the CPU to apply a single instruction (such as a filter comparison or an arithmetic addition) to a vector of data points in a single clock cycle. This hardware-level parallelism turns data processing from a memory-bound or CPU-bound bottleneck into an efficient operation running directly on the processor.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/arrow-in-memory-format.png&quot; alt=&quot;Comparison diagram showing the row-based layout versus Apache Arrow&apos;s columnar in-memory format and zero-serialization pointer exchange&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;In-Process SQL Powerhouse: DuckDB Architecture &amp;amp; Features&lt;/h2&gt;
&lt;p&gt;DuckDB has become the standard database engine for single-node SQL analytics. Designed as an in-process analytical database, DuckDB runs directly inside the host process (such as a Python interpreter or a CLI binary) rather than as a separate server daemon. This eliminates the network socket latency and IPC overhead of client-server databases like PostgreSQL or Snowflake.&lt;/p&gt;
&lt;p&gt;DuckDB&apos;s execution engine utilizes a vectorized query execution model. Rather than processing data one row at a time (the Volcano iterator model) or processing entire columns at once (which overflows L1/L2 caches for large tables), DuckDB processes data in small, cache-friendly vectors. These vectors typically contain 2048 elements.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Volcano Model:       [Row 1] ──► [Operator] ──► [Row 2] ──► [Operator]
Column-at-a-time:    [Entire Column (10M rows)] ──► [Operator] (Overflows Cache)
Vectorized Model:    [Vector of 2048 rows] ──► [L1/L2 CPU Cache] ──► [Operator]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By keeping these vectors small enough to fit inside the CPU&apos;s L1/L2 cache, DuckDB minimizes memory bandwidth bottlenecks. The CPU executes operations on the vectors using SIMD instructions, keeping the execution pipelines saturated with data.&lt;/p&gt;
&lt;p&gt;To handle datasets that exceed physical RAM, DuckDB implements out-of-core execution. When memory consumption reaches a user-defined limit, DuckDB&apos;s buffer manager automatically spills intermediate query states (such as hash join tables, sorting buffers, or aggregation states) to temporary disk files. This spilling mechanism uses a block-based buffer pool that page-faults data to disk, allowing you to run queries on datasets that are multiple times larger than your system&apos;s RAM.&lt;/p&gt;
&lt;p&gt;In the latest v1.5.3 release (May 2026), DuckDB has introduced several updates that expand its single-node utility:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Quack Remote Protocol:&lt;/strong&gt; DuckDB now ships with a core extension implementing the Quack protocol. This protocol allows users to run DuckDB in a client-server configuration when needed, facilitating remote attachments and remote query orchestration without losing the simplicity of the engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem and Format Updates:&lt;/strong&gt; The Iceberg extension has been upgraded to support &lt;code&gt;MERGE INTO&lt;/code&gt; operations, making it possible to execute complex delta updates on Iceberg tables directly from a local DuckDB session.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Security and IRSA:&lt;/strong&gt; Native support for IAM Roles for Service Accounts (IRSA) has been added, simplifying secure S3 access when running DuckDB inside containerized single-node pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Static Linking:&lt;/strong&gt; The distribution now statically links &lt;code&gt;jemalloc&lt;/code&gt; on Linux platforms, improving memory allocation speed and reducing fragmentation during heavy out-of-core spilling.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following Python script illustrates how to configure DuckDB&apos;s memory limits, register an S3 credential using the new AWS extension features, and run a query that spills to disk:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import duckdb

# Initialize DuckDB connection
con = duckdb.connect(database=&apos;local_cache.db&apos;)

# Set memory limit to force out-of-core spilling on smaller datasets
con.execute(&amp;quot;SET max_memory=&apos;8GB&apos;;&amp;quot;)
con.execute(&amp;quot;SET temp_directory=&apos;./duckdb_temp&apos;;&amp;quot;)

# Load S3 and AWS extensions (built-in in v1.5.3)
con.execute(&amp;quot;INSTALL aws;&amp;quot;)
con.execute(&amp;quot;LOAD aws;&amp;quot;)

# Autodetect AWS credentials from environment (supports IRSA)
con.execute(&amp;quot;CALL load_aws_credentials();&amp;quot;)

# Query a large Parquet dataset directly on S3 with predicate pushdown
# DuckDB only downloads the columns and row groups that match the filter
query = &amp;quot;&amp;quot;&amp;quot;
    SELECT
        user_id,
        COUNT(event_id) as event_count,
        AVG(session_duration) as avg_duration
    FROM read_parquet(&apos;s3://my-lakehouse/bronze/events/**/*.parquet&apos;)
    WHERE event_date &amp;gt;= &apos;2026-01-01&apos;
    GROUP BY user_id
    HAVING event_count &amp;gt; 1000
    ORDER BY avg_duration DESC
&amp;quot;&amp;quot;&amp;quot;

# Execute and stream results
result = con.execute(query).fetchdf()
print(result.head())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;DuckDB&apos;s combination of SQL support, vectorized performance, and out-of-core stability makes it a core tool for local analytical workloads.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/duckdb-vectorized-architecture.png&quot; alt=&quot;DuckDB vectorized execution architecture showing chunked vector pipelines inside CPU cache and out-of-core spilling to SSD temp files&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Extensible Rust Processing: Apache Arrow DataFusion&lt;/h2&gt;
&lt;p&gt;While DuckDB is packaged as an analytical database, Apache Arrow DataFusion is designed as an extensible query engine framework. Written in Rust and utilizing Apache Arrow as its native memory format, DataFusion is widely used to build other databases, query engines, and custom data platforms (including Bauplan, Spice.ai, and LakeSail).&lt;/p&gt;
&lt;p&gt;DataFusion&apos;s design is modular. It decouples the query planning, optimization, and execution stages. If you are building a custom data tool, you can register custom catalogs, write user-defined logical optimization rules (like custom predicate pushdowns), or plug in custom physical execution nodes.&lt;/p&gt;
&lt;p&gt;For thread-level parallelism, DataFusion utilizes Rust&apos;s asynchronous Tokio runtime. Rather than pinning execution to a fixed number of threads, DataFusion distributes physical plan fragments (represented as asynchronous streams of Arrow &lt;code&gt;RecordBatch&lt;/code&gt; objects) across a Tokio worker thread pool. This allows the engine to adapt to multi-core architectures and avoid thread contention under heavy I/O loads.&lt;/p&gt;
&lt;p&gt;In the recent v53.x and v54.x releases (early-to-mid 2026), the DataFusion community has introduced several optimizations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Datetime Predicate Preimages:&lt;/strong&gt; DataFusion now optimizes queries containing datetime functions (like &lt;code&gt;date_trunc&lt;/code&gt; and &lt;code&gt;date_part&lt;/code&gt;) by evaluating their mathematical &amp;quot;preimages.&amp;quot; Instead of executing the datetime function on every row, the optimizer rewrites the filter predicate against the raw partition bounds, enabling partition pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sort Pushdown Phase 2:&lt;/strong&gt; The engine now sorts file groups by physical statistics before executing sort operators. If a set of Parquet files contains non-overlapping sorted ranges, DataFusion skips the global sort merge step, reducing planning and CPU execution times.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null-Aware Anti-Joins:&lt;/strong&gt; Support has been optimized for null-aware anti-joins, which frequently occur in SQL queries containing &lt;code&gt;NOT IN&lt;/code&gt; clauses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Variant Type Integration:&lt;/strong&gt; The planner has introduced initial support for the binary &lt;code&gt;VARIANT&lt;/code&gt; format, laying the groundwork for format-agnostic semi-structured data querying.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following Rust code snippet demonstrates how to initialize a DataFusion context, register an in-memory Arrow table, and execute a query programmatically:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;use datafusion::prelude::*;
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::arrow::schema::{DataType, Field, Schema};
use datafusion::arrow::array::{Int32Array, StringArray};
use std::sync::Arc;

#[tokio::main]
async fn main() -&amp;gt; datafusion::error::Result&amp;lt;()&amp;gt; {
    // Create a local execution context
    let ctx = SessionContext::new();

    // Define a simple schema
    let schema = Arc::new(Schema::new(vec![
        Field::new(&amp;quot;id&amp;quot;, DataType::Int32, false),
        Field::new(&amp;quot;name&amp;quot;, DataType::Utf8, false),
    ]));

    // Create Arrow arrays
    let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
    let name_array = StringArray::from(vec![&amp;quot;Alice&amp;quot;, &amp;quot;Bob&amp;quot;, &amp;quot;Charlie&amp;quot;, &amp;quot;David&amp;quot;, &amp;quot;Eve&amp;quot;]);

    // Build the record batch
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![Arc::new(id_array), Arc::new(name_array)],
    )?;

    // Register the record batch as an in-memory table
    ctx.register_batch(&amp;quot;users&amp;quot;, batch)?;

    // Execute SQL query
    let df = ctx.sql(&amp;quot;SELECT name FROM users WHERE id &amp;gt; 2&amp;quot;).await?;

    // Print the physical execution plan
    df.show().await?;

    Ok(())
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This library-first model makes DataFusion the preferred choice for teams building specialized, high-performance data systems.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/datafusion-rust-architecture.png&quot; alt=&quot;DataFusion extensible Rust architecture showing SQL/DataFrame inputs compiled into physical plans running on Arrow memory, with pluggable catalogs and custom execution nodes&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Vectorized DataFrames: Polars Eager &amp;amp; Lazy Pipelines&lt;/h2&gt;
&lt;p&gt;For developers working in Python, Rust, or JavaScript, DataFrames are the preferred API for data manipulation. While Pandas has been the standard in Python for a decade, it is single-threaded, has a high memory footprint (often requiring 5–10x the dataset size in RAM), and does not support query optimization.&lt;/p&gt;
&lt;p&gt;Polars is a Rust-native, Arrow-backed DataFrame library designed to replace Pandas. It is optimized for multi-core execution, utilizing a custom work-stealing CPU scheduler that distributes execution chunks across available cores.&lt;/p&gt;
&lt;p&gt;Polars offers two execution modes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Eager API:&lt;/strong&gt; Executes operations immediately, step-by-step, mimicking Pandas&apos; behavior. This mode is useful for interactive debugging in Jupyter Notebooks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lazy API:&lt;/strong&gt; Builds a logical Directed Acyclic Graph (DAG) representing the pipeline. When you call &lt;code&gt;.collect()&lt;/code&gt;, Polars passes the DAG through a query optimizer. The optimizer applies several rules:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Projection Pushdown:&lt;/strong&gt; Only reads the columns explicitly referenced in the query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Predicate Pushdown:&lt;/strong&gt; Moves filter operations as close to the storage layer as possible (pushing them down into the Parquet reader).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Common Subexpression Elimination:&lt;/strong&gt; Identifies duplicate calculations and executes them once.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;Eager: Load File (All Columns) ──► Filter Rows ──► Select Columns
Lazy:  Query Planner ──► Push Filter &amp;amp; Select Into File Reader ──► Load File (Filtered &amp;amp; Pruned)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In 2026, the Polars team officially stabilized its streaming execution engine. This engine allows out-of-core DataFrame execution on datasets that exceed physical memory limits. The streaming engine now supports:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streaming Merge and AsOf Joins:&lt;/strong&gt; Useful for temporal alignments (such as joining financial tick data or IoT sensor metrics).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming Aggregations:&lt;/strong&gt; Complex statistical calculations (including skew, kurtosis, and entropy) can now run in streaming mode.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Direct Cloud Sinks:&lt;/strong&gt; Polars can stream data directly back to storage formats like Delta Lake (&lt;code&gt;sink_delta&lt;/code&gt;) and Apache Iceberg (&lt;code&gt;sink_iceberg&lt;/code&gt;) without materializing the intermediate tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To enable the streaming engine, developers configure Polars to use the streaming execution path:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl

# Enable streaming engine affinity globally
pl.Config.set_engine_affinity(&amp;quot;streaming&amp;quot;)

# Define a Lazy pipeline querying a folder of compressed CSVs
lazy_query = (
    pl.scan_csv(&amp;quot;./data/raw_metrics/*.csv&amp;quot;)
    .filter(pl.col(&amp;quot;metric_type&amp;quot;) == &amp;quot;cpu_utilization&amp;quot;)
    .with_columns(
        (pl.col(&amp;quot;metric_value&amp;quot;) * 100).alias(&amp;quot;percentage&amp;quot;)
    )
    .group_by([&amp;quot;host_id&amp;quot;, &amp;quot;timestamp&amp;quot;])
    .agg([
        pl.col(&amp;quot;percentage&amp;quot;).mean().alias(&amp;quot;mean_cpu&amp;quot;),
        pl.col(&amp;quot;percentage&amp;quot;).skew().alias(&amp;quot;skew_cpu&amp;quot;) # Uses new streaming aggregations
    ])
    .sort(&amp;quot;mean_cpu&amp;quot;, descending=True)
)

# Execute the query out-of-core using the streaming engine
# This will process files in batches, avoiding Out-Of-Memory (OOM) crashes
result_df = lazy_query.collect(streaming=True)
print(result_df.head())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Polars&apos; combination of an expressive DataFrame API, lazy query optimization, and stabilized streaming makes it a powerful engine for Python and Rust developers.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/polars-lazy-evaluation.png&quot; alt=&quot;Polars query planning diagram showing Eager sequential execution vs Lazy DAG optimization pathways with projection and predicate pushdowns&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Comparative Analysis: Evaluating Single-Node Engines&lt;/h2&gt;
&lt;p&gt;Choosing the right tool requires evaluating their architectural differences and primary API surfaces:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;DuckDB&lt;/th&gt;
&lt;th&gt;Apache Arrow DataFusion&lt;/th&gt;
&lt;th&gt;Polars&lt;/th&gt;
&lt;th&gt;LakeSail (Sail)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;C++&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Types&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL, Python, R, Node.js, C++&lt;/td&gt;
&lt;td&gt;SQL, DataFrame (Rust/Python)&lt;/td&gt;
&lt;td&gt;DataFrame (Python/Rust/JS)&lt;/td&gt;
&lt;td&gt;PySpark, Spark Connect SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Native Memory Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom Vector / Arrow IPC&lt;/td&gt;
&lt;td&gt;Apache Arrow&lt;/td&gt;
&lt;td&gt;Apache Arrow&lt;/td&gt;
&lt;td&gt;Apache Arrow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vectorization Pattern&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed-size Vectors (2048 rows)&lt;/td&gt;
&lt;td&gt;Arrow RecordBatches&lt;/td&gt;
&lt;td&gt;Contiguous Arrow arrays&lt;/td&gt;
&lt;td&gt;Arrow RecordBatches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Out-of-Core Method&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Buffer Pool Disk Spilling&lt;/td&gt;
&lt;td&gt;Streaming RecordBatch execution&lt;/td&gt;
&lt;td&gt;Streaming Engine (Lazy API)&lt;/td&gt;
&lt;td&gt;DataFusion-backed streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Use Case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL-first analytical queries&lt;/td&gt;
&lt;td&gt;Query engine library / framework&lt;/td&gt;
&lt;td&gt;Dataframe transformations&lt;/td&gt;
&lt;td&gt;JVM-free PySpark execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very Low (single binary/import)&lt;/td&gt;
&lt;td&gt;Moderate (library setup)&lt;/td&gt;
&lt;td&gt;Low (&lt;code&gt;pip install polars&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Low (&lt;code&gt;pip install pysail&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Key Tradeoffs to Consider&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;API Choice:&lt;/strong&gt; If your team writes standard SQL, DuckDB is the logical starting point. If you write procedural code, Polars&apos; expression language is more expressive and easier to parallelize than SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensibility vs. Out-of-the-Box Utility:&lt;/strong&gt; DuckDB and Polars are complete user-facing applications. DataFusion is an engine framework. You use DataFusion if you are building a custom database or need to modify how the physical query execution layer functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Footprint:&lt;/strong&gt; DataFusion and Polars generally maintain a lower memory footprint than DuckDB for in-memory operations due to Rust&apos;s memory management model and direct mapping to Arrow structures. However, DuckDB&apos;s buffer manager is more mature for highly complex queries that require massive disk spilling.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;Zero-JVM Spark: High-Performance Pipelines with LakeSail&lt;/h2&gt;
&lt;p&gt;For teams with existing data codebases, the primary blocker to adopting single-node tools is the legacy API footprint. Many organizations have thousands of lines of Apache Spark code written in PySpark. Rewriting these pipelines to DuckDB SQL or Polars DataFrames is expensive and introduces validation risks.&lt;/p&gt;
&lt;p&gt;Historically, running PySpark locally required spinning up a local Spark cluster. This cluster runs on the Java Virtual Machine (JVM), which introduces significant configuration complexity and memory overhead. A default local Spark session can easily consume 4 GB of RAM just to start, even when processing a 10 MB CSV file.&lt;/p&gt;
&lt;p&gt;Furthermore, PySpark operates via a Py4J gateway bridge. When your PySpark code calls a Python User-Defined Function (UDF), the data must be serialized, sent from the JVM to a Python worker process, processed, serialized again, and sent back to the JVM. This cross-process serialization tax makes Python UDF execution in Spark slow.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Traditional PySpark UDF Path:
[JVM Executor] ──(Serialize via Py4J)──► [Python Worker] ──► [Run UDF] ──(Serialize)──► [JVM Executor]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;LakeSail&lt;/strong&gt; (specifically the open-source &lt;strong&gt;Sail&lt;/strong&gt; engine) solves this constraint. Sail is a Rust-native, JVM-free compute engine designed as a drop-in replacement for Apache Spark. It implements the Spark Connect protocol, allowing existing PySpark and Spark SQL applications to run unmodified by connecting to a Sail server over gRPC.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;LakeSail PySpark Connect Path:
[PySpark Session] ──(Spark Connect gRPC Logical Plan)──► [LakeSail Rust Server] ──► [DataFusion Physical Execution]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Under the hood, Sail replaces Spark&apos;s JVM-based Catalyst optimizer and Tungsten execution engine with Apache DataFusion and Apache Arrow. This architecture provides several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Zero JVM Overhead:&lt;/strong&gt; Sail starts in milliseconds and has a negligible idle memory footprint. You can run Spark code on small single-core VMs or local laptops.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zero-Copy Python UDF Execution:&lt;/strong&gt; Sail embeds a Python interpreter directly into its Rust binary using PyO3. When executing a Python UDF, Sail passes pointers to the Arrow memory buffers directly to the Python interpreter. The UDF executes in-process without serialization, eliminating the cross-process Py4J bottleneck.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Native Open Formats:&lt;/strong&gt; Sail includes native Rust-based support for Delta Lake, Apache Iceberg, and Parquet, integrating directly with AWS Glue, Unity Catalog, and Polaris REST catalogs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To run your PySpark pipelines against a local Sail session, you install the packages and point the session builder to the local Sail gRPC port:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Install pysail and PySpark client supporting Spark Connect
pip install pysail pyspark
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Start the Sail server from your terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Start local Sail gRPC server on port 50051
sail spark server --port 50051
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your Python code, connect the &lt;code&gt;SparkSession&lt;/code&gt; to the local Sail server using the standard remote connection string:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

# Connect to the local Sail Rust-native server over Spark Connect protocol
spark = SparkSession.builder \
    .remote(&amp;quot;sc://localhost:50051&amp;quot;) \
    .getOrCreate()

# Load a local Parquet dataset using standard Spark DataFrame API
df = spark.read.parquet(&amp;quot;./data/raw_orders&amp;quot;)

# Define a standard Python UDF
@udf(returnType=IntegerType())
def calculate_tax(amount):
    # This runs in-process via Sail&apos;s PyO3 integration
    # Zero serialization tax is paid between Rust and Python
    return int(amount * 0.08)

# Execute transformations and show results
processed_df = df.filter(col(&amp;quot;status&amp;quot;) == &amp;quot;COMPLETED&amp;quot;) \
                 .withColumn(&amp;quot;tax&amp;quot;, calculate_tax(col(&amp;quot;total_amount&amp;quot;)))

processed_df.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By keeping the Spark API surface while replacing the execution engine, LakeSail allows teams to modernize their legacy PySpark pipelines and run them on single nodes without the overhead of a JVM.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/lakesail-spark-connect.png&quot; alt=&quot;LakeSail Spark Connect architecture showing PySpark client communicating over gRPC to a Rust-native Spark Connect server with DataFusion and PyO3 embedded UDF zero-copy memory buffers&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Threshold of Scale: When Does Single-Node Break?&lt;/h2&gt;
&lt;p&gt;While single-node data engineering has expanded the scale of data that can be processed on a single machine, it is not a silver bullet. At a certain point, physical resource constraints make single-node architectures impractical.&lt;/p&gt;
&lt;p&gt;The primary bottleneck is I/O. During out-of-core execution, spilling data to disk shifts the bottleneck from memory capacity to disk read/write bandwidth. Even on fast NVMe SSDs, writing and reading hundreds of gigabytes of intermediate join tables or sorting buffers introduces latency. If a query spends more time reading and writing temporary blocks to disk than it does executing CPU cycles, the system is I/O-bound.&lt;/p&gt;
&lt;p&gt;The second bottleneck is query planning and CPU execution scaling. If your query must scan multiple terabytes of data, even a vectorized engine running on 64 cores will take minutes to complete the scan. If your business SLAs require sub-second or low-second query latencies, you need to distribute the scanning and processing work across multiple machines in parallel.&lt;/p&gt;
&lt;p&gt;The third bottleneck is organizational concurrency. If a single VM hosts your analytical database, and hundreds of analysts or BI dashboards query it simultaneously, the CPU cores will experience thread starvation, and lock contention will slow execution times for all users.&lt;/p&gt;
&lt;p&gt;To guide your architectural transitions, use the following operational decision framework:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Single-Node Range&lt;/th&gt;
&lt;th&gt;MPP Transition Trigger&lt;/th&gt;
&lt;th&gt;Distributed MPP Target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compressed Data Volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 100 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;gt; 500 GB – 1 TB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-TB to Petabytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Target Query Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minutes (OK for batch/ad-hoc)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 3 – 5 Seconds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sub-second interactive BI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrent Users / Queries&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 5–10 concurrent sessions&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;gt; 20+ concurrent queries&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hundreds of concurrent dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Topology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local files or single S3 bucket&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Federated across multiple sources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lakehouses, warehouses, transactional DBs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/scale-threshold-matrix.png&quot; alt=&quot;Performance-cost threshold graph showing single-node vs MPP execution efficiency zones based on data scale&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The MPP Landscape: Scaling to Spark, Dremio, Bauplan, SpiceAI, and MotherDuck&lt;/h2&gt;
&lt;p&gt;When your data scale, latency requirements, or concurrency needs exceed single-node limits, you must transition to a distributed MPP (Massively Parallel Processing) architecture. The modern MPP landscape offers several pathways, depending on your workflow patterns.&lt;/p&gt;
&lt;h3&gt;MotherDuck (Dual Execution)&lt;/h3&gt;
&lt;p&gt;For teams who want to scale their DuckDB workloads to the cloud without managing infrastructure, MotherDuck provides a serverless platform built on DuckDB.&lt;/p&gt;
&lt;p&gt;MotherDuck&apos;s core architectural pattern is &lt;strong&gt;Dual Execution&lt;/strong&gt; (formerly hybrid execution). When you submit a query, MotherDuck&apos;s query planner evaluates the locations of the datasets. It splits the query plan: executing parts of the query locally on your laptop CPU using local cached data, and executing other parts on MotherDuck&apos;s cloud compute nodes (for cloud-hosted Parquet or Iceberg tables). The engine joins these streams dynamically using specialized &amp;quot;bridge&amp;quot; operators.&lt;/p&gt;
&lt;p&gt;In early 2026, MotherDuck added a native &lt;strong&gt;PostgreSQL wire protocol endpoint&lt;/strong&gt;. This allows BI tools and legacy applications to connect directly to MotherDuck using standard PostgreSQL drivers, eliminating the need to install the DuckDB runtime on the client machine. Additionally, MotherDuck features &lt;strong&gt;Pulse (serverless)&lt;/strong&gt; billing with one-second increments and &lt;strong&gt;DuckLake&lt;/strong&gt; integration for scaling storage to the petabyte range.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/motherduck-hybrid-execution.png&quot; alt=&quot;MotherDuck Dual Execution model showing how queries are split by a Hybrid Planner between local laptop CPUs and MotherDuck Cloud Engines&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Bauplan (Serverless Python Pipelines)&lt;/h3&gt;
&lt;p&gt;For data engineers building pipeline workflows on Apache Iceberg, Bauplan provides a serverless, &amp;quot;zero-infrastructure&amp;quot; execution engine.&lt;/p&gt;
&lt;p&gt;Instead of managing Spark or Kubernetes clusters to run scheduled data transformations, you define your pipeline steps as standard Python or SQL functions. Bauplan spins up stateless, ephemeral compute on-demand to execute the code and shuts down immediately after, utilizing a pay-per-invocation model.&lt;/p&gt;
&lt;p&gt;Bauplan integrates Apache Iceberg with Project Nessie, providing a &amp;quot;Git-for-data&amp;quot; experience. Developers and AI agents can create isolated branches of the lakehouse, run experimental Python pipelines to verify changes, and merge the updates atomically back into production without risking data corruption or paying for idle staging compute.&lt;/p&gt;
&lt;h3&gt;Spice.ai (Federated Query Acceleration)&lt;/h3&gt;
&lt;p&gt;Spice.ai (SpiceAI) targets the data access layer for high-performance applications and AI agents. It functions as a federated query runtime that accelerates slow data queries by materializing &amp;quot;hot&amp;quot; data sets locally.&lt;/p&gt;
&lt;p&gt;Spice.ai implements a tiered caching model. It caches query results in-memory and caches active working sets of data in high-performance local engines like DuckDB or Cayenne (a native columnar engine).&lt;/p&gt;
&lt;p&gt;In its recent v2.0 updates, Spice.ai introduced a &lt;strong&gt;prefix-aware list-files cache&lt;/strong&gt; that speeds up data lake scans, a &lt;strong&gt;statistics cache&lt;/strong&gt; for file metadata, and native Change Data Capture (CDC) syncing that streams updates from databases (like PostgreSQL WAL streams) directly into the local acceleration cache. This keeps the local cached tables updated in real-time without requiring complex Kafka or Debezium setups.&lt;/p&gt;
&lt;h3&gt;Dremio (Distributed MPP Lakehouse Platform)&lt;/h3&gt;
&lt;p&gt;For enterprise-scale BI, multi-source data federation, and semantic layer management, Dremio serves as the central engine of the lakehouse.&lt;/p&gt;
&lt;p&gt;Dremio is built from the ground up on Apache Arrow, eliminating the serialization tax entirely. When Dremio queries data, the physical execution plan processes memory structures natively in Arrow columnar format and streams results to clients (such as Python scripts or BI tools) using Arrow Flight.&lt;/p&gt;
&lt;p&gt;Dremio achieves sub-second performance on massive cloud data lakes through three architectural layers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Columnar Cloud Cache (C3):&lt;/strong&gt; Automatically caches data blocks from object storage (like AWS S3 or Azure ADLS) onto local NVMe drives at execution nodes, turning remote cloud I/O into local disk read speeds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections:&lt;/strong&gt; Dremio&apos;s query planner automatically and transparently substitutes physically optimized, pre-computed Iceberg materializations to accelerate user queries. As of Dremio v26, Reflections store data exclusively in Iceberg format, deprecating legacy formats to streamline the storage path. Dremio&apos;s &lt;strong&gt;Autonomous Reflections&lt;/strong&gt; use AI to observe query patterns over a rolling 7-day window, automatically creating, updating, and dropping Reflections to maintain optimal dashboard performance without manual administration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Catalog (Powered by Apache Polaris):&lt;/strong&gt; Dremio&apos;s built-in catalog is built on Apache Polaris, which graduated to a top-level Apache project in 2026. The Open Catalog implements the Apache Iceberg REST specification, allowing other engines (like Spark or Flink) to query the same tables securely. It provides Fine-Grained Access Control (FGAC) including column-masking and row-level filtering.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s &lt;strong&gt;AI Semantic Layer&lt;/strong&gt; allows teams to define virtual datasets (views) once and reuse them across all BI and AI applications. This layer embeds descriptions, wikis, and tags directly onto columns and datasets. The semantic layer teaches AI models the business context of your data, allowing AI agents to generate correct, governed SQL queries rather than hallucinating generic code. Dremio also embeds generative AI features to auto-generate wiki descriptions and suggest tags based on schema patterns.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/dremio-mpp-acceleration.png&quot; alt=&quot;Dremio MPP query engine architecture showing Columnar Cloud Cache on NVMe, Iceberg-based Autonomous Reflections, Open Catalog powered by Polaris, and Arrow Flight client streaming&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Architectural Selection Framework and Conclusion&lt;/h2&gt;
&lt;p&gt;Modern data engineering is no longer about choosing between a local script and a massive cluster. It is about matching your toolchain to your data volume, latency SLAs, and organizational needs.&lt;/p&gt;
&lt;p&gt;To guide your selection, follow this decision tree:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Is your workload running locally or on a single node?&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;If you prefer writing SQL for analytical queries:&lt;/em&gt; Use &lt;strong&gt;DuckDB&lt;/strong&gt;. It requires zero configuration and handles larger-than-memory data via out-of-core spilling.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If you are writing procedural Python or Rust DataFrame pipelines:&lt;/em&gt; Use &lt;strong&gt;Polars&lt;/strong&gt;. Its lazy optimizer and stabilized streaming engine provide rapid execution.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If you have legacy PySpark or Spark SQL code but want to avoid JVM overhead:&lt;/em&gt; Use &lt;strong&gt;LakeSail&lt;/strong&gt;. It executes Spark Connect gRPC logical plans natively in Rust.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If you are building a custom query engine or analytical tool:&lt;/em&gt; Use &lt;strong&gt;Apache Arrow DataFusion&lt;/strong&gt; as your modular compiler framework.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does your workload exceed single-node capabilities (multi-TB scale, high concurrency, or cross-source BI)?&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;If you want a serverless, hybrid extension of your DuckDB SQL code:&lt;/em&gt; Use &lt;strong&gt;MotherDuck&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If you need to build serverless Python pipelines directly on Iceberg with Git-like version control:&lt;/em&gt; Use &lt;strong&gt;Bauplan&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If you need to cache and accelerate federated data for local AI/RAG applications:&lt;/em&gt; Use &lt;strong&gt;Spice.ai&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If you need enterprise-scale BI, semantic governance, multi-source federation, and sub-second SQL queries on Iceberg:&lt;/em&gt; Use &lt;strong&gt;Dremio&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/single-node-data-engineering/architectural-decision-tree.png&quot; alt=&quot;Flowchart decision tree helping engineers select the correct analytical engine based on workload and scale&quot;&gt;&lt;/p&gt;
&lt;p&gt;Single-node data technologies have shifted the boundary of what is possible on a single machine. By utilizing Apache Arrow for zero-copy memory layouts, compilers like DataFusion, and vectorized execution engines, you can process workloads that previously required a complex distributed cluster.&lt;/p&gt;
&lt;p&gt;As you design your next data platform, start by evaluating if your workload can run on a single node. Modern columnar engines let you build, test, and run pipelines with minimal infrastructure complexity. When your data scale or organizational concurrency requires a distributed architecture, transition incrementally using open standards like Apache Iceberg and Apache Arrow.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Accelerate Your Lakehouse Skills&lt;/h3&gt;
&lt;p&gt;To deepen your understanding of modern data architectures, consider the following next steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read Lakehouse Reference Materials:&lt;/strong&gt; Explore &lt;strong&gt;&amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot;&lt;/strong&gt; and other technical publications that cover partition tuning, catalog design, and query optimization at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Your Own Local Pipeline:&lt;/strong&gt; Start by downloading &lt;code&gt;pysail&lt;/code&gt; or &lt;code&gt;polars&lt;/code&gt; and testing them against a local Parquet dataset. Compare the query planning time and CPU memory footprint against your existing frameworks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluate Dremio Cloud:&lt;/strong&gt; If your local query engines are hitting limits or you need to federate data across multiple sources, deploy Dremio directly on your S3 data lake. Try Dremio Cloud free for 30 days at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>An In-Depth Overview of the Apache Iceberg 1.11.0 Release</title><link>https://iceberglakehouse.com/posts/2026-05-23-apache-iceberg-1-11-0-deep-dive/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-23-apache-iceberg-1-11-0-deep-dive/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-05-apache-iceberg-1-11-0-deep-dive/...</description><pubDate>Sat, 23 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-05-apache-iceberg-1-11-0-deep-dive/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Apache Iceberg 1.11.0 was officially released on May 19, 2026, marking a major milestone in the evolution of open data lakehouse architectures. While minor point releases often focus on small bug fixes and dependency bumps, this release introduces fundamental structural changes. The community has completed major initiatives to improve security, extend file format capabilities, and optimize query planning overhead.&lt;/p&gt;
&lt;p&gt;This release represents a convergence of two development focuses. First, it introduces structural changes to the core metadata specification to support advanced security features and lay the groundwork for future format revisions. Second, it stabilizes several feature sets in the Iceberg format specification, moving them from experimental status to fully stable defaults.&lt;/p&gt;
&lt;p&gt;To understand the context of this release, it helps to review the history of the Apache Iceberg specification. The V1 specification focused on the core foundations of the data lake: defining metadata schemas, enabling basic schema evolution, and introducing hidden partitioning to eliminate directory-based partition layouts. The V2 specification, which has been the production standard for several years, introduced row-level delete support through copy-on-write (COW) and merge-on-read (MOR) operations. The V3 specification, which reaches production maturity in the 1.11.0 release, focuses on optimizing read paths, securing metadata, and standardizing complex data types like semi-structured records and spatial coordinates.&lt;/p&gt;
&lt;p&gt;This post analyzes the most critical improvements in the Apache Iceberg 1.11.0 release. We will examine the specific GitHub pull requests, explain the underlying mechanics of each feature, and review what these changes mean for data engineers and platform architects.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/release-pillar-overview.png&quot; alt=&quot;Apache Iceberg 1.11.0 release overview diagram showing Security, Catalog, Storage, and Engine pillars&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Manifest List Encryption (PR #7770, #15813)&lt;/h2&gt;
&lt;p&gt;Security in open data lakehouses has historically focused on encrypting the actual data files stored in object storage. While file-level encryption prevents unauthorized users from reading raw Parquet or ORC data, the table metadata has remained exposed. In a default setup, anyone with read access to the storage bucket could inspect the metadata JSON, manifest lists, and manifest files.&lt;/p&gt;
&lt;p&gt;These metadata files contain sensitive structural details. An attacker scanning an unencrypted manifest list can extract file paths, column names, partitions, partition bounds, and exact null value counts. In highly regulated industries such as healthcare or financial services, this structural exposure constitutes a major data leak.&lt;/p&gt;
&lt;p&gt;To resolve this vulnerability, PR #7770, introduced by @ggershinsky, adds native encryption for manifest lists. This change works alongside follow-up improvements in PR #15813. Manifest lists can now be encrypted using the Galois/Counter Mode (GCM) stream cipher.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Metadata JSON (Contains encryption state references)
       │
       ▼
Manifest List (Encrypted via GCM Stream Cipher) ◄── Decrypted in-memory during planning
       │
       ▼
Manifest Files (Point to encrypted Parquet data files)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The table encryption configuration can be defined during table creation or updated via table properties:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;encryption.kms.impl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;None&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;The fully qualified class name of the Key Management Service client.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;encryption.kms.key-id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;None&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;The master key identifier used to encrypt data encryption keys (DEKs).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;encryption.gcm.key-length&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;256&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The length of the encryption key in bits (128, 192, or 256).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;When a query engine plans a scan against an encrypted table, it performs the following sequence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The client queries the catalog to fetch the table metadata.&lt;/li&gt;
&lt;li&gt;The catalog returns the metadata location along with the required decryption keys.&lt;/li&gt;
&lt;li&gt;The query engine reads the encrypted manifest list from object storage.&lt;/li&gt;
&lt;li&gt;Using the catalog keys, the engine decrypts the manifest list in-memory.&lt;/li&gt;
&lt;li&gt;The engine processes the decrypted partitions and statistics to prune manifest files.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The choice of the GCM cipher is technically significant. In traditional Block Cipher Chaining (CBC) modes, decryption must occur sequentially from the beginning of the file, which adds latency. In contrast, GCM allows parallelized, seek-aware random-access decryption. This capability is critical for query engines during planning: the engine can read and decrypt only the specific blocks of the manifest list it needs to plan the query, avoiding the overhead of decrypting the entire file.&lt;/p&gt;
&lt;p&gt;This approach implements a model of envelope encryption: each metadata file is encrypted with a unique data encryption key (DEK), and these DEKs are encrypted using the table&apos;s master key managed by the Key Management Service (KMS). Even if an attacker gains raw access to the storage bucket, they find only encrypted bytes, protecting both the table contents and its structural metadata.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/manifest-list-encryption-flow.png&quot; alt=&quot;Manifest list encryption sequence showing key exchange and decryption query planning&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Pluggable File Format API and V4 Spec Foundations (PR #15049)&lt;/h2&gt;
&lt;p&gt;Historically, Apache Iceberg hardcoded its support for data file formats. The core library contained format-specific code paths for Parquet, ORC, and Avro. If you wanted to query or write a table, the engine executed internal code blocks tailored to those exact structures.&lt;/p&gt;
&lt;p&gt;This hardcoded design created a major bottleneck for format innovation. If a team wanted to test a next-generation format, they had to modify the core engine codebase, extending complex switch statements and format-dependent utilities.&lt;/p&gt;
&lt;p&gt;PR #15049, introduced by @anoopj, restructures this architecture. It introduces a pluggable File Format API that decouples Iceberg core metadata management from physical storage layouts.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                  Iceberg Core Engine                   │
└───────────────────────────┬────────────────────────────┘
                            │
                            ▼
┌────────────────────────────────────────────────────────┐
│            File Format API Interface Layer             │
└──────┬────────────┬─────────────┬─────────────┬────────┘
       │            │             │             │
       ▼            ▼             ▼             ▼
  ┌─────────┐  ┌─────────┐   ┌─────────┐   ┌─────────┐
  │ Parquet │  │   ORC   │   │ Vortex  │   │  Lance  │
  └─────────┘  └─────────┘   └─────────┘   └─────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The File Format API provides a clean plugin interface. A file format is defined as a plugin that implements standard reader and writer interfaces. Iceberg core negotiates table transactions, schemas, and partition specs, while delegating the physical file access to the registered plugin.&lt;/p&gt;
&lt;p&gt;This decoupling makes it practical to support next-generation formats:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Vortex:&lt;/strong&gt; A general-purpose, modular format designed as a successor to Parquet. It is optimized for high-performance analytics, utilizing fixed-width columns with bitmap masks for nulls. This enables Single Instruction Multiple Data (SIMD) filtering directly on memory-mapped files without CPU decompression cycles. The community is actively using the new API to build a Vortex-backed Iceberg plugin.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lance:&lt;/strong&gt; A layout built for machine learning and AI workloads. It is optimized for high-dimensional vector search and random access to nested embeddings, implementing index structures such as Inverted File with Product Quantization (IVF-PQ) directly in the file format to enable fast query planning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nimble:&lt;/strong&gt; A format optimized for wide tables containing thousands of feature columns. Nimble prioritizes fast decoding over high compression ratios, opting for lightweight run-length and bit-packing compression schemes. This reduces the CPU overhead of ML training loops that consume millions of rows per second.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, PR #15049 introduces the foundational Java interfaces and types for the upcoming V4 manifest specification. These changes prepare Iceberg for format-agnostic manifest storage, ensuring the metadata layer can scale to tables with millions of files without hitting Java memory overhead limits.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/v4-manifest-foundations.png&quot; alt=&quot;Pluggable File Format API architecture decoupling Iceberg core from format plugins&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;REST Client Protocols and Extended Headers (PR #12194)&lt;/h2&gt;
&lt;p&gt;The REST Catalog protocol has become the standard interface for managing Iceberg tables across multiple processing engines. It isolates clients from catalog catalog details and provides a unified API for schema management, snapshot commits, and credential vending.&lt;/p&gt;
&lt;p&gt;However, as deployments scale inside large enterprises, catalogs need to process custom client context. For example, a platform team might want to track which business unit submitted a query, pass custom security tokens, or inject correlation IDs for distributed tracing. In previous versions, the standard &lt;code&gt;RESTClient&lt;/code&gt; did not allow clients to send custom HTTP headers.&lt;/p&gt;
&lt;p&gt;PR #12194, written by @gaborkaszab, solves this constraint by extending header support inside the &lt;code&gt;RESTClient&lt;/code&gt; implementations.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────┐
│      Iceberg REST Client       │
│  (Spark, Flink, Trino, etc.)   │
└───────────────┬────────────────┘
                │
                │  POST /v1/namespaces/db/tables/events
                │  Custom-Headers:
                │    - X-Trace-Id: trace-98421
                │    - X-Tenant-Id: finance-billing
                │
                ▼
┌────────────────────────────────┐
│      REST Catalog Server       │
│  (Parses headers for auditing)  │
└────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this update, client engines can configure and inject custom headers into every REST call. The client-server handshake follows this sequence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The client initializes the REST catalog using the properties map.&lt;/li&gt;
&lt;li&gt;The client specifies static custom headers using the prefix &lt;code&gt;header.custom.&lt;/code&gt;:&lt;pre&gt;&lt;code class=&quot;language-properties&quot;&gt;header.custom.X-Tenant-Id=finance-billing
header.custom.X-Trace-Id=system-trace-99
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;During request execution, the &lt;code&gt;RESTClient&lt;/code&gt; intercepts the HTTP call and injects these custom headers.&lt;/li&gt;
&lt;li&gt;The REST catalog server processes the headers to apply dynamic authorization, audit logging, or request routing.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This change enables the following capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Auditing and Governance:&lt;/strong&gt; Engines can pass tenant identifiers or user profiles in the HTTP headers, allowing the REST catalog server to log catalog operations with full user context.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed Tracing:&lt;/strong&gt; Tracing headers such as W3C Trace Context can propagate from client engines through the catalog server, providing end-to-end trace visibility for query planning operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic Authorization:&lt;/strong&gt; Clients can send custom authorization tokens that the REST catalog server evaluates dynamically to enforce fine-grained access control.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The properties are configured during catalog initialization using the standard configuration map, making it simple to roll out headers across existing query platforms.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/rest-client-headers.png&quot; alt=&quot;Extended header propagation between Iceberg client and REST Catalog server&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Overwrite-Aware Table Registration (PR #15525)&lt;/h2&gt;
&lt;p&gt;In multi-tenant data platforms, multiple engines frequently access and modify the same table metadata. When registering a new table or importing an existing table state into the catalog, concurrency conflicts can occur.&lt;/p&gt;
&lt;p&gt;If two separate processes attempt to register or overwrite a table reference at the same location simultaneously, a naive catalog might register the second request, silently overwriting the first. This creates data inconsistencies where the catalog points to an outdated or incorrect &lt;code&gt;metadata.json&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;PR #15525, written by @sririshindra, adds overwrite-aware table registration to the catalog API.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Writer 1: Commits events_v1 ────► [Catalog Table Pointer] ◄──── Writer 2: Commits events_v2
                                            │
                                            ├────────► If conflict: Catalog rejects Writer 2
                                            └────────► Prevents silent metadata overwrites
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This implementation leverages Optimistic Concurrency Control (OCC) at the catalog level. The conflict resolution sequence proceeds as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Writer A and Writer B both read the current table state pointing to snapshot v1.&lt;/li&gt;
&lt;li&gt;Writer A writes new data files, generating metadata version &lt;code&gt;metadata_v2.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Writer B writes new data files in parallel, generating metadata version &lt;code&gt;metadata_v3.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Writer A calls the catalog&apos;s &lt;code&gt;/v1/namespaces/db/tables/events/register&lt;/code&gt; endpoint, stating that the expected base location is &lt;code&gt;metadata_v1.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The catalog verifies the base matches, registers the new pointer to &lt;code&gt;metadata_v2.json&lt;/code&gt;, and updates the table version.&lt;/li&gt;
&lt;li&gt;Writer B attempts to register its state, listing &lt;code&gt;metadata_v1.json&lt;/code&gt; as its expected base.&lt;/li&gt;
&lt;li&gt;The catalog detects that the current pointer is now &lt;code&gt;metadata_v2.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The catalog rejects Writer B&apos;s request, returning a HTTP 409 Conflict. Writer B must re-read the updated table state, resolve any overlapping partition commits, and retry the registration.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This validation ensures that table registration is safe and prevents silent metadata overwrites in highly active environments.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/overwrite-aware-registration.png&quot; alt=&quot;Flowchart of table registration verifying catalog overwrite state and rejecting transaction on conflicts&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Deletion Vector Pruning in Snapshot Validation (PR #15653)&lt;/h2&gt;
&lt;p&gt;One of the major highlights of the V3 format specification is the stabilization of deletion vectors. Deletion vectors improve row-level delete performance by replacing positional delete files with Roaring bitmaps. Instead of writing a new delete file for every minor update, the engine updates a binary bitmap linked directly to the data file.&lt;/p&gt;
&lt;p&gt;These deletion bitmaps are stored in the Puffin file format. You can inspect active deletion vector locations using metadata system tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT file_path, pos, row_position, deletion_vector
FROM TABLE(table_files(&apos;my_catalog.schema.events&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, as tables grow to hold millions of data files, validating these deletion vectors during query planning can introduce latency. During scan planning, the query engine must ensure that the deletion vectors linked in the metadata are valid and match the corresponding data files.&lt;/p&gt;
&lt;p&gt;In earlier versions, this validation was executed across the entire table snapshot during plan initialization. If you had a 50 TB table and queried a single day, the planner still spent time validating deletion vectors for the entire table.&lt;/p&gt;
&lt;p&gt;PR #15653, introduced by @anoopj, optimizes this process. It adds manifest partition pruning to deletion vector validation inside the &lt;code&gt;MergingSnapshotProducer&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Query Filter: WHERE event_date = &apos;2026-05-23&apos;
       │
       ▼
Partition Pruning Step
       │
       ├─► Skip Partition &apos;2026-05-22&apos; ──► Skip Deletion Vector Validation
       │
       └─► Read Partition &apos;2026-05-23&apos;  ──► Run Deletion Vector Validation
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this change, the query planner matches the query filter predicates against partition bounds before executing deletion vector checks. If a partition is pruned out, the engine skips validating the deletion vectors for the files in that partition. This change reduces planning CPU cycles and improves scan startup times for partitioned tables.&lt;/p&gt;
&lt;p&gt;For a detailed look at how hidden partitioning helps the query engine perform partition pruning and reduce metadata scan sizes, refer to the &lt;a href=&quot;/posts/2024-5-partitioning-with-apache-iceberg-deep-dive/&quot;&gt;Apache Iceberg Partitioning Deep Dive&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/deletion-vector-pruning.png&quot; alt=&quot;Diagram showing deletion vector validation pruning skipping skipped partitions during planning&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Scheduled Credential Lifecycle Refresh (PR #15678, #15732, #15696)&lt;/h2&gt;
&lt;p&gt;To security-harden data lakehouses, platforms avoid using long-lived storage credentials. Instead, query engines authenticate using temporary tokens vended by the REST catalog or cloud identity providers. These credentials typically have short lifespans, often expiring after one hour.&lt;/p&gt;
&lt;p&gt;This security model creates issues for long-running operations. If a massive query runs for 90 minutes, or a streaming Flink sink runs continuously, the temporary credentials expire mid-job. When the client attempts to write new files or fetch manifests after the expiration window, the storage client throws an authentication exception, failing the job.&lt;/p&gt;
&lt;p&gt;The 1.11.0 release resolves this lifecycle problem. PR #15678 (by @danielcweeks) and PR #15732 (by @nastra) add scheduled refresh threads to the &lt;code&gt;S3FileIO&lt;/code&gt; client. A parallel change in PR #15696 (by @nastra) implements the same capability for &lt;code&gt;GCSFileIO&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Query Thread (Reads/Writes Data)
       │
       ├───────► Token Expiration Approaching (e.g. at 50 minutes)
       │
Background Refresh Thread
       │
       ├───────► Send Request to Catalog ──► Fetch New Credentials
       │
       └───────► Update S3FileIO/GCSFileIO Credentials In-Memory
       │
Query Thread (Continues without interruption)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The credential refresh system runs a background daemon thread that tracks token expiration times. The lifecycle is controlled by the following properties:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3.credentials-refresh-interval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;None&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;The interval at which the S3FileIO refresh thread checks and requests new credentials.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gcs.oauth2.token-expires-in&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3600&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The lifespan in seconds of the GCS OAuth token before the refresh thread requests a new one.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Before the active credential expires, the background thread automatically polls the catalog&apos;s &lt;code&gt;/v1/tokens&lt;/code&gt; endpoint for refreshed tokens and updates the file system client in-memory. The main query and write threads continue to run without interruption, eliminating query failures caused by expired credentials.&lt;/p&gt;
&lt;p&gt;This scheduled refresh is particularly important in enterprise Kubernetes environments. In these setups, pod identities are linked to IAM roles with short-lived session durations. By handling this rotation transparently within the &lt;code&gt;FileIO&lt;/code&gt; layer, Iceberg removes the need for engines to restart or implement external wrapper scripts to manage token state.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/credential-refresh-thread.png&quot; alt=&quot;Sequence flow showing background thread updating AWS/GCS storage client credentials before expiration&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Spark Streaming Triggers and Z-Ordering (PR #13824, #15706)&lt;/h2&gt;
&lt;p&gt;Apache Spark remains the primary engine for heavy write workloads and batch compaction in Iceberg tables. Version 1.11.0 includes several updates to improve Spark streaming and layout optimization.&lt;/p&gt;
&lt;h3&gt;Trigger.AvailableNow Support (PR #13824, #14026)&lt;/h3&gt;
&lt;p&gt;PR #13824, introduced by @alexprosak, adds support for the &lt;code&gt;AvailableNow&lt;/code&gt; trigger in Spark Structured Streaming. This change was also backported to Spark 4.0, 3.5, and 3.4 in PR #14026.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Continuous Trigger:
[Read Batch 1] -&amp;gt; [Write] -&amp;gt; [Wait] -&amp;gt; [Read Batch 2] -&amp;gt; [Write] -&amp;gt; (Runs indefinitely)

AvailableNow Trigger:
[Scan All Available Data] -&amp;gt; [Process Batch 1] -&amp;gt; [Process Batch 2] -&amp;gt; [Write All] -&amp;gt; [Graceful Shutdown]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In Spark streaming, the default trigger runs continuously in the background, consuming resources even when no new files are arriving. The alternative &lt;code&gt;Once&lt;/code&gt; trigger processes only a single batch and shuts down, which can leave data unprocessed if a large backlog has accumulated.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;AvailableNow&lt;/code&gt; trigger combines the benefits of both approaches. It scans the source for all available data, splits the workload into consecutive micro-batches, processes them all in a single run, and then shuts down the streaming context. This is configured in PySpark as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Configure Trigger.AvailableNow with Iceberg source and sink
query = spark.readStream \
    .format(&amp;quot;iceberg&amp;quot;) \
    .load(&amp;quot;prod_catalog.db.events&amp;quot;) \
    .writeStream \
    .format(&amp;quot;iceberg&amp;quot;) \
    .trigger(availableNow=True) \
    .option(&amp;quot;checkpointLocation&amp;quot;, &amp;quot;/mnt/checkpoints/events&amp;quot;) \
    .toTable(&amp;quot;prod_catalog.db.events_compacted&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This trigger configuration allows data platforms to run streaming ingestion pipelines as scheduled cron jobs, reducing cluster idle time.&lt;/p&gt;
&lt;h3&gt;Z-Order Column Collision Validation (PR #15706)&lt;/h3&gt;
&lt;p&gt;PR #15706, introduced by @YanivZalach, addresses a failure mode during Z-order layout optimization. Spark uses the internal column name &lt;code&gt;ICEZVALUE&lt;/code&gt; during Z-order sorting. If a user table already contained a column named &lt;code&gt;ICEZVALUE&lt;/code&gt;, the compaction process failed or generated incorrect sort orders.&lt;/p&gt;
&lt;p&gt;The update adds strict schema validation that checks for column name collisions before running Z-order compactions, preventing data corruption.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/spark-streaming-available-now.png&quot; alt=&quot;Comparison of continuous micro-batch streaming vs Spark AvailableNow trigger batches&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Flink Post-Commit Maintenance and Branch Compaction (PR #15566, #15672, #14148)&lt;/h2&gt;
&lt;p&gt;Apache Flink is the standard engine for real-time streaming ingestion into Iceberg tables. Streaming ingestion has different write characteristics than batch ingestion, often writing many small files at high frequency. Iceberg 1.11.0 adds features to manage these files directly within Flink pipelines.&lt;/p&gt;
&lt;h3&gt;Flink Post-Commit Maintenance (PR #15566, #15667)&lt;/h3&gt;
&lt;p&gt;PR #15566, written by @mxm, adds support for arbitrary post-commit maintenance tasks inside the Flink &lt;code&gt;IcebergSink&lt;/code&gt; builder. This is also backported to active Flink branches in PR #15667.&lt;/p&gt;
&lt;p&gt;During streaming ingestion, Flink commits data to the Iceberg table at every checkpoint. These frequent commits generate a large number of small manifest files. With the new post-commit interface, you can attach background maintenance tasks directly to the sink:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;// Configure Flink sink with post-commit compaction
IcebergSink.forRowData(dataStream, tableLoader)
    .table(icebergTable)
    .tableLoader(tableLoader)
    .writeParallelism(4)
    .distributionMode(DistributionMode.HASH)
    .postCommitMaintenance(
        PostCommitMaintenance.builder()
            .optimizeDataFiles(true)
            .rewriteManifests(true)
            .build()
    )
    .append();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After a commit succeeds, Flink runs compaction and manifest cleaning tasks in the background, keeping the table structure optimized without requiring external scheduler jobs.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Flink Stream Ingestion
       │
       ▼
[Commit Data File (Checkpoint)]
       │
       ├───────► Post-Commit Trigger
       │
       ▼
[Background Maintenance Action (RewriteDataFiles / Compaction)]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Flink Branch Compaction Support (PR #15672, #15690)&lt;/h3&gt;
&lt;p&gt;PR #15672, also written by @mxm, adds branch support to the Flink &lt;code&gt;RewriteDataFiles&lt;/code&gt; action.&lt;/p&gt;
&lt;p&gt;Historically, Flink&apos;s background compaction actions could only run on the table&apos;s main branch. In modern architectures, engines often ingest experimental data or staging runs into separate table branches. Flink can now run file compaction directly on these non-main branches, keeping staging and experiment branches organized before they are merged back.&lt;/p&gt;
&lt;h3&gt;Flink Metadata Columns (PR #14148)&lt;/h3&gt;
&lt;p&gt;PR #14148, introduced by @Guosmilesmile, exposes metadata columns to Flink readers.&lt;/p&gt;
&lt;p&gt;Flink applications can now read the &lt;code&gt;_row_id&lt;/code&gt; and &lt;code&gt;_last_updated_sequence_number&lt;/code&gt; system columns. This is useful for CDC (Change Data Capture) reconciliation pipelines that need to track the exact ingestion sequence of rows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/flink-maintenance-branch-support.png&quot; alt=&quot;Flink data sink writing data and executing post-commit branch compaction on experimental branch&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;JSON to Variant Mapping and Spec Cleanups (PR #13137, #14045)&lt;/h2&gt;
&lt;p&gt;The Variant type is a key part of the Iceberg V3 specification, designed to store semi-structured data using a binary representation that supports predicate pushdown. Iceberg 1.11.0 refines this integration across multiple engines.&lt;/p&gt;
&lt;h3&gt;Variant Type Validation (PR #13137, #14081)&lt;/h3&gt;
&lt;p&gt;PR #13137 (by @manirajv06) and PR #14081 (by @geruh) add schema validation and filtering rules for the Variant type in Parquet metrics.&lt;/p&gt;
&lt;p&gt;These updates ensure that Parquet file readers can extract column-level statistics from nested variant structures. This allows the query engine to prune files based on nested variant fields, improving query performance.&lt;/p&gt;
&lt;h3&gt;Trino Variant Type Mapping&lt;/h3&gt;
&lt;p&gt;In parallel, query engine connectors are adopting these changes. Trino now maps its native &lt;code&gt;JSON&lt;/code&gt; type to Iceberg&apos;s Variant type in V3 tables. This means you can write JSON data from Trino and query it with predicate pushdown, avoiding the performance penalties of plain string JSON.&lt;/p&gt;
&lt;h3&gt;Positional Deletes with Row Data Deprecated (PR #14045)&lt;/h3&gt;
&lt;p&gt;PR #14045, written by @pvary, deprecates positional delete files that embed row data.&lt;/p&gt;
&lt;p&gt;In Iceberg V2, positional delete files could store the actual deleted row data alongside the file path and row offset. While this design saved a join step during reads, it duplicated data in the delete files, increasing storage costs and metadata complexity.&lt;/p&gt;
&lt;p&gt;The community has deprecated this option in favor of Deletion Vectors, simplifying the V3 read path.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Table Upgrade Path and Connector Compatibility&lt;/h2&gt;
&lt;p&gt;All V3 features : manifest list encryption, deletion vectors, Variant types, geospatial types, and nanosecond timestamps , require upgrading your tables to format version 3.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Existing V2 Table
       │
       ├───────► Run: ALTER TABLE events SET TBLPROPERTIES (&apos;format-version&apos; = &apos;3&apos;)
       │
Upgraded V3 Table
       │
       ├───────► New writes use Deletion Vectors and Variant type
       └───────► Existing data files are left untouched (no rewrite required)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The upgrade is a metadata-only operation executed using SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Upgrade an existing table to Iceberg V3 format version
ALTER TABLE my_catalog.schema.events
SET TBLPROPERTIES (&apos;format-version&apos; = &apos;3&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This operation updates the &lt;code&gt;format-version&lt;/code&gt; pointer in the table&apos;s metadata JSON. It does not rewrite your existing data files, which remain in place and continue to be readable.&lt;/p&gt;
&lt;p&gt;New writes to the table will adopt V3 features automatically. For example, subsequent update or delete statements will write deletion vectors instead of positional delete files.&lt;/p&gt;
&lt;h3&gt;Lifecycle Status Updates&lt;/h3&gt;
&lt;p&gt;Before planning your migration to V3, review the engine compatibility changes in Iceberg 1.11.0:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Java 11 Support Dropped:&lt;/strong&gt; Iceberg 1.11.0 drops support for Java 11. Core libraries and engine connectors now require &lt;strong&gt;Java 17&lt;/strong&gt; or &lt;strong&gt;Java 21&lt;/strong&gt;. Migrating to Java 17 was a critical decision for the community, allowing the codebase to utilize modern JVM language features (such as Java records, pattern matching, and enhanced switch expressions) to improve metadata parsing efficiency and reduce CPU utilization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spark 3.4 Support Deprecated:&lt;/strong&gt; Support for Spark 3.4 is deprecated. Teams should migrate to Spark 3.5 or Spark 4.0+.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink 1.19 Support Removed:&lt;/strong&gt; Flink 1.19 is no longer supported. The release adds support for &lt;strong&gt;Flink 2.1.0&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Make sure all query engines and toolchains in your lakehouse deployment support Iceberg V3 and Java 17 before upgrading production tables.&lt;/p&gt;
&lt;p&gt;For more on managing table maintenance and compaction strategies for your Iceberg tables, refer to the &lt;a href=&quot;/posts/2026-05-22-apache-iceberg-maintenance-compaction/&quot;&gt;Apache Iceberg Maintenance and Compaction Guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/iceberg-1-11-0/table-upgrade-path.png&quot; alt=&quot;Table upgrade timeline showing migration SQL and deprecated connector support list&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg 1.11.0 is a significant release for the project. It moves beyond incremental enhancements to deliver major architectural updates.&lt;/p&gt;
&lt;p&gt;The unified File Format API restructures how Iceberg interacts with physical storage formats. This change makes it easier to integrate next-generation codecs designed for AI and high-performance workloads.&lt;/p&gt;
&lt;p&gt;At the same time, the stabilization of V3 features provides a production-ready path for deletion vectors, Variant data, geospatial types, and nanosecond precision. These features help organizations optimize query performance and reduce operational overhead.&lt;/p&gt;
&lt;p&gt;If you are running Iceberg V2 tables in production, evaluate your workloads to identify tables that will benefit from a V3 upgrade. In particular, tables with active update patterns or large JSON columns will see immediate performance gains.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Build Your Data Lakehouse Expertise&lt;/h3&gt;
&lt;p&gt;If you are designing, building, or managing modern data platforms, staying ahead of formatting specifications is critical. To deepen your understanding of these technologies, consider reading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot;&lt;/strong&gt;: An architectural guide to designing open lakehouse platforms, managing catalog architectures, partition tuning, and optimizing table layouts for high-performance query execution engines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Other Data Lakehouse Publications&lt;/strong&gt;: Practical books and reference materials covering hidden partitioning, metadata structure, schema evolution, and query acceleration engines in enterprise data systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Find these books and other lakehouse learning resources at &lt;a href=&quot;https://books.alexmerced.com&quot;&gt;books.alexmerced.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To query your newly upgraded Iceberg V3 tables with automatic file layout optimization, background partition-level compaction, reflection acceleration, and zero infrastructure management, start a free trial of Dremio Cloud at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Open Table Format Benchmarks: Why They Require Critical Evaluation</title><link>https://iceberglakehouse.com/posts/2026-05-22-open-table-format-benchmarks-guide/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-open-table-format-benchmarks-guide/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-04-29-apache-iceberg-masterclass-01...</description><pubDate>Fri, 22 May 2026 12:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-04-29-apache-iceberg-masterclass-01-table-formats/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The transition from traditional, closed data warehouses to open lakehouse architectures is one of the most significant shifts in modern data engineering. By decoupling storage formats from query processing engines, organizations can store their data in public cloud object storage while executing queries using specialized, high-performance engines. At the center of this transition are open table formats: Apache Iceberg, Delta Lake, and Apache Hudi.&lt;/p&gt;
&lt;p&gt;As organizations evaluate these formats, performance is frequently cited as a primary decision criterion. This focus has led to a flood of performance benchmarks published by vendors, cloud providers, and independent technology groups. These benchmarks, often utilizing standard industry test suites like TPC-H or TPC-DS, make bold claims about which format is the fastest, the most cost-effective, or the most scalable.&lt;/p&gt;
&lt;p&gt;However, for data architects and engineers, these benchmarks can be difficult to interpret. A benchmark published by one vendor may show that Delta Lake is multiple times faster than Apache Iceberg, while a study published by another show that Iceberg outperforms Delta Lake on identical hardware. This divergence arises because open table format performance is not a static property of the table layout itself. Instead, performance is the result of a complex interaction between the physical data layout, the query engine, the client library versions, and the underlying cloud infrastructure.&lt;/p&gt;
&lt;p&gt;This guide provides an in-depth analysis of the open table format benchmark landscape. We examine the methodology behind these benchmarks and explain why they must be taken with a grain of salt. We analyze the technical variables that influence performance results, outline a workload-centric evaluation framework, and provide a guide for selecting a format based on ecosystem alignment. Finally, we discuss how query engines like the Dremio engine accelerate performance across all formats.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. The Landscape of Table Format Benchmarks&lt;/h2&gt;
&lt;p&gt;To understand why table format benchmarks often yield conflicting results, we must first look at the landscape of published studies and the methodologies they use.&lt;/p&gt;
&lt;h3&gt;Vendor-Sponsored Benchmarks&lt;/h3&gt;
&lt;p&gt;Most benchmarks available online are sponsored by companies that have a financial interest in the adoption of a specific table format. For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Databricks&lt;/strong&gt; frequently publishes benchmarks demonstrating the performance of Delta Lake, often highlighting its integration with the proprietary Photon execution engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Onehouse&lt;/strong&gt;, founded by the creators of Apache Hudi, publishes studies showing Hudi&apos;s strength in handling real-time ingestion, incremental processing, and mutable workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tabular&lt;/strong&gt; (subsequently acquired by Snowflake), founded by the creators of Apache Iceberg, published analyses detailing Iceberg&apos;s efficiency in query planning, metadata pruning, and cross-engine operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These studies typically rely on standard benchmarks like TPC-DS, which simulates a retail product supplier with complex query patterns, or TPC-H, which focuses on ad-hoc decision support queries. While these benchmarks are designed to be objective, the configurations used to run them can be adjusted to favor one format or engine over another.&lt;/p&gt;
&lt;h3&gt;Independent and Community Studies&lt;/h3&gt;
&lt;p&gt;In addition to vendor publications, independent consulting groups, open-source communities, and data engineering teams at large enterprises have published their own evaluations. These studies often focus on practical engineering concerns, such as how easily a format integrates with multiple engines (such as Spark, Trino, Flink, and Dremio), the difficulty of setting up write transactions, and how performance degrades over time as data is modified.&lt;/p&gt;
&lt;p&gt;These independent reports often paint a more balanced picture. They show that while one format might have a slight edge in write speed, another might offer better read performance for specific query types, while a third might provide superior integration with legacy catalog systems.&lt;/p&gt;
&lt;h3&gt;The Problem of Static Benchmarks&lt;/h3&gt;
&lt;p&gt;The primary limitation of any published benchmark is that it represents a static snapshot of a specific system configuration at a single point in time. In the fast-moving world of open-source data engineering, table formats and query engines are updated constantly. An optimization introduced in a new release of Apache Iceberg or Delta Lake can render previous benchmark results obsolete. Therefore, relying on external benchmarks to make long-term architectural decisions is a risky approach.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. Why Benchmarks Require Critical Evaluation&lt;/h2&gt;
&lt;p&gt;When evaluating table format benchmarks, data teams must look beyond the headline numbers and scrutinize the underlying methodologies. There are four critical technical variables that influence performance results and can easily bias findings if not controlled for.&lt;/p&gt;
&lt;h3&gt;1. Compute Architecture and Sizing&lt;/h3&gt;
&lt;p&gt;The hardware and compute resources used to execute a benchmark have a significant impact on the results. In cloud environments, compute performance is determined by virtual machine instance types, CPU generation, memory allocation, and disk configurations.&lt;/p&gt;
&lt;p&gt;For instance, some query engines rely heavily on local NVMe SSDs to cache data blocks, while others read data directly from cloud object storage. If a benchmark is executed on instances with fast local storage, an engine that caches data aggressively will show a massive performance advantage. However, this advantage may not translate to a production environment if the data team deploys the engine on standard instances without local SSDs to reduce costs.&lt;/p&gt;
&lt;p&gt;Furthermore, cluster size plays a key role. A benchmark run on a small 4-node cluster may highlight metadata parsing overhead as a major bottleneck, whereas the same test run on a 100-node cluster might be limited by network bandwidth or object storage rate limits. When reviewing a benchmark, you must verify that the compute sizing aligns with what your organization can realistically afford and manage in production.&lt;/p&gt;
&lt;h3&gt;2. Engine Selection, Version, and Format Optimizations&lt;/h3&gt;
&lt;p&gt;A table format does not execute queries; a query engine does. Therefore, a table format benchmark is always a test of a specific query engine running on top of that format.&lt;/p&gt;
&lt;p&gt;This distinction is critical because query engines are optimized for different table formats in varying degrees. For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Databricks has spent years optimizing its runtime and the Photon engine to work with Delta Lake. Running a benchmark on Databricks comparing Delta Lake and Apache Iceberg may show that Delta Lake is faster, but this is a reflection of Databricks&apos; engine optimizations rather than an inherent limitation of the Iceberg format.&lt;/li&gt;
&lt;li&gt;Similarly, engines like Snowflake, Trino, and Dremio have built native connectors for Apache Iceberg that optimize partition pruning, statistical calculations, and metadata reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Furthermore, the version of the engine used in the test can skew results. An engine version that lacks support for Iceberg metadata caching or Delta Lake log pruning will perform poorly compared to a newer version where those features are enabled. If a benchmark compares Format A using a highly optimized engine version and Format B using a basic or legacy connector, the results are misleading.&lt;/p&gt;
&lt;h3&gt;3. Version of the Table Format Libraries&lt;/h3&gt;
&lt;p&gt;Like query engines, the table format libraries themselves evolve rapidly. Each minor and major release introduces performance improvements, write optimizations, and metadata fixes.&lt;/p&gt;
&lt;p&gt;For example, early versions of Apache Iceberg relied primarily on Copy-on-Write for updates, which introduced write latency. The introduction of Merge-on-Read, positional delete files, and optimized delete file writers in later versions of the Iceberg library significantly reduced update overhead.&lt;/p&gt;
&lt;p&gt;If a benchmark compares Delta Lake using its latest library version against an older version of Apache Iceberg, it will fail to capture these improvements. When analyzing benchmark results, data teams must verify the exact library versions used and confirm that all formats were configured with equivalent performance-enhancing features.&lt;/p&gt;
&lt;h3&gt;4. Object Storage Latency and Network Throughput&lt;/h3&gt;
&lt;p&gt;Because open lakehouses store data in cloud object storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage), storage latency and network throughput are major performance variables.&lt;/p&gt;
&lt;p&gt;Cloud storage is not a local disk; it is a distributed network service. API requests to retrieve data (GET requests) and metadata (LIST requests) are subject to network latency, throttling, and request limits. If an engine must make hundreds of metadata requests to resolve a query, object storage latency will dominate the execution time.&lt;/p&gt;
&lt;p&gt;To mitigate this, engines use metadata caching and file bundling. If a benchmark does not account for object storage fluctuations or fails to configure connection pooling and caching properly, the results will vary from run to run. A format that appears fast in one test may appear slow in another due to transient network congestion or S3 throttling during the run.&lt;/p&gt;
&lt;h3&gt;5. Architectural Differences in Statistical Metadata Storage&lt;/h3&gt;
&lt;p&gt;Another factor that biases query planning benchmarks is how each format stores and structures statistical metadata for query planning. Query engines rely on column-level statistics (such as minimum/maximum values, null counts, and value counts) to prune files and determine join orders.&lt;/p&gt;
&lt;p&gt;Apache Iceberg stores these statistics directly inside its manifest files, partitioned hierarchically. This allows the query coordinator to prune irrelevant files during the metadata scanning phase without reading the data files themselves. In contrast, Delta Lake stores statistics within its transaction log JSON files and periodically bundles them into Parquet checkpoints. If a query engine does not natively optimize the parsing of these transaction logs, or if the checkpoint files become large, the metadata scanning phase will experience significant delays. Apache Hudi utilizes a dedicated Metadata Table with internal index structures (like bloom filters and column statistics) to accelerate query planning.&lt;/p&gt;
&lt;p&gt;When a benchmark is run using an engine that has deep integration with one format&apos;s metadata structures but lacks equivalent optimization for another, query planning times will be artificially skewed. An engine might plan an Iceberg query in milliseconds but take seconds to plan a Delta Lake query simply because its log parser is inefficient.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;3. The Pitfall of Synthetic Benchmarks vs. Real-World Workloads&lt;/h2&gt;
&lt;p&gt;Standard benchmarks like TPC-DS are valuable for testing engine capabilities, but they do not replicate the query patterns, ingestion frequencies, or data structures of a real-world enterprise. Relying solely on synthetic benchmarks can lead data teams to select a format that is ill-suited for their actual workloads.&lt;/p&gt;
&lt;h3&gt;TPC-DS vs. Enterprise Query Patterns&lt;/h3&gt;
&lt;p&gt;TPC-DS queries are complex, multi-way joins designed to simulate business intelligence reporting on a clean, relational schema. In contrast, real-world data lakehouse queries often target denormalized tables, json fields, or flat files.&lt;/p&gt;
&lt;p&gt;Moreover, TPC-DS datasets are static. The data is loaded once, and a series of read-only queries are executed. In a production lakehouse, data is constantly updated. Ingestion pipelines write micro-batches, updates are applied via CDC, and maintenance tasks compact files in the background. A table format that excels in a read-only TPC-DS test may perform poorly under the pressure of concurrent reads and writes.&lt;/p&gt;
&lt;p&gt;Let us illustrate this with a practical example. Suppose our target workload consists of joining customer profiles and order transactions. We will use our standard schema names: &lt;code&gt;analytics.customers&lt;/code&gt; and &lt;code&gt;analytics.orders&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In a synthetic benchmark, the query might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT c.state, COUNT(o.order_id) AS total_orders, SUM(o.amount) AS total_amount
  FROM rest_catalog.analytics.customers c
  JOIN rest_catalog.analytics.orders o
    ON c.customer_id = o.customer_id
 WHERE o.order_date &amp;gt;= DATE &apos;2026-01-01&apos;
 GROUP BY c.state;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To optimize this query in a synthetic benchmark, a developer might manually partition the tables by &lt;code&gt;state&lt;/code&gt; and &lt;code&gt;order_date&lt;/code&gt;, and run compaction immediately before executing the read test.&lt;/p&gt;
&lt;p&gt;In a real-world production environment, however:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;analytics.orders&lt;/code&gt; table receives continuous writes, creating small files.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;analytics.customers&lt;/code&gt; table undergoes SCD Type 2 updates, producing delete files.&lt;/li&gt;
&lt;li&gt;The query is executed concurrently by dozens of business analysts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under these conditions, a static benchmark cannot predict which format will perform best. The performance will be determined by how efficiently the format handles concurrent updates, how quickly the engine processes delete files on the fly, and how well background compaction jobs run without locking the tables.&lt;/p&gt;
&lt;h3&gt;Concurrency Control and Write Conflicts&lt;/h3&gt;
&lt;p&gt;Synthetic read benchmarks ignore the impact of write conflicts and commit locking patterns. In production environments, tables must handle concurrent operations.&lt;/p&gt;
&lt;p&gt;All three formats employ Optimistic Concurrency Control (OCC) to handle simultaneous writes. OCC assumes that conflicts are rare; when a transaction begins, it reads the current table state and prepares its updates. At commit time, it checks if another writer has modified the table. If a conflict is detected, the transaction must retry.&lt;/p&gt;
&lt;p&gt;The implementation details of these commits differ by format and catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; relies on catalog-specific locking mechanisms. When using an AWS Glue catalog, it uses Glue&apos;s native locking; when using Nessie, it relies on git-like commit operations; when using a REST catalog, the locking is handled by the REST server. This provides fine-grained control and scalability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt; historically relied on file system drivers (such as S3 multi-part upload or Azure storage leases) or Databricks-managed control planes to coordinate ACID commits on object storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi&lt;/strong&gt; supports multi-writer configurations using lock providers like ZooKeeper, Hive Metastore, or Amazon DynamoDB.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If a concurrent ingestion benchmark is executed on a catalog that has poor locking performance on cloud object storage, commit times will spike, and transactions will fail due to write collisions. Read-only synthetic benchmarks completely ignore these operational constraints, hiding the performance degradation that occurs under heavy multi-writer pressure.&lt;/p&gt;
&lt;h3&gt;Workload-Centric Evaluation&lt;/h3&gt;
&lt;p&gt;Rather than relying on vendor-published TPC-DS numbers, data teams should implement a workload-centric evaluation framework. This involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Defining Your Data Profile&lt;/strong&gt;: Identify your ingestion patterns (such as batch, micro-batch, or streaming), update frequencies, delete rates, and data volume.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Identifying Core Queries&lt;/strong&gt;: Select a representative set of queries from your actual workloads, including BI dashboards, ad-hoc reports, and ML training pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Building a Test Environment&lt;/strong&gt;: Deploy your chosen query engines (such as Spark, Trino, or Dremio) on hardware configurations that match your production budget.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simulating Production Operations&lt;/strong&gt;: Run ingestion jobs, apply updates, execute queries, and run compaction tasks simultaneously. Measure query latencies, write speeds, and storage footprints under this realistic load.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By executing a workload-centric test, you will obtain performance metrics that directly reflect how the formats will behave in your environment, allowing you to make an informed architectural decision.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Ecosystem Alignment: How to Choose a Format&lt;/h2&gt;
&lt;p&gt;While performance is an important consideration, the long-term success of a lakehouse initiative depends heavily on ecosystem alignment and tooling support. A table format is only as useful as the tools that can read and write it.&lt;/p&gt;
&lt;p&gt;Let us outline the key factors to consider when choosing a table format, and establish clear guidelines for when to select each option.&lt;/p&gt;
&lt;h3&gt;Why Apache Iceberg is the Standard Default&lt;/h3&gt;
&lt;p&gt;For most organization-wide lakehouse initiatives, Apache Iceberg should be the default choice. This recommendation is based on Iceberg&apos;s design, governance model, and broad ecosystem support.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engine-Neutral Specification&lt;/strong&gt;: Iceberg was designed from the beginning to be independent of any single processing engine. It was developed at Netflix to solve scalability issues with Hive, and is governed by the Apache Software Foundation. This ensures that no single vendor controls the roadmap or restricts features to proprietary platforms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad Engine Support&lt;/strong&gt;: Because it is engine-neutral, virtually every major data tool has built native integration for Apache Iceberg. This includes open-source engines (Spark, Flink, Trino, Presto), cloud query engines (AWS Athena, Google BigQuery, Snowflake), and modern acceleration layers like the Dremio engine. This multi-engine compatibility prevents vendor lock-in, allowing data teams to use Spark for ingestion, Trino for ad-hoc queries, Snowflake for BI, and Dremio for sub-second acceleration, all querying the same physical Iceberg files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Features&lt;/strong&gt;: Iceberg offers robust implementations of hidden partitioning, schema evolution, partition evolution, snapshot isolation, and time travel. These features make it highly stable and easy to manage at scale.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security and Credential Vending&lt;/strong&gt;: The Iceberg REST Catalog specification introduces a standardized protocol for credential vending. When a query engine connects to the REST catalog, the catalog server authenticates the client and dynamically generates short-lived, scoped access tokens (such as temporary S3 credentials or SAS tokens) for the specific files the client needs to read or write. This removes the need to distribute broad, permanent storage-level IAM credentials directly to every query engine or client application. This standardized security protocol distinguishes Iceberg from Delta Lake, which has historically relied on direct filesystem-level authentication configurations or platform-specific access layers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use Delta Lake&lt;/h3&gt;
&lt;p&gt;Delta Lake, originally created by Databricks, is a high-performance table format with a large user base. It should be considered under the following conditions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All-in on the Databricks Stack&lt;/strong&gt;: If your organization&apos;s data platform is built entirely on Databricks, Delta Lake is the logical choice. Databricks provides native optimization features for Delta Lake that may not be available for other formats within their environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accepting Vendor Lock-In&lt;/strong&gt;: While Delta Lake is open-source, its roadmap and primary optimizations are heavily driven by Databricks. If you choose Delta Lake, you must accept that the best performance and newest features may require running Databricks runtimes, and that integrating Delta Lake with non-Spark engines (like Snowflake or BigQuery) may introduce additional configurations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use DuckLake&lt;/h3&gt;
&lt;p&gt;DuckLake is an emerging pattern tailored for lightweight or embedded data analytics workflows. It should be considered under these specific constraints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All-in on DuckDB&lt;/strong&gt;: If your analytical pipelines are designed around DuckDB for local, in-memory, or single-node processing, DuckLake offers an efficient mechanism to manage tables without the overhead of a full Hadoop or Spark cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Small-Scale Analytics&lt;/strong&gt;: DuckLake is ideal for edge computing, local development, or small-scale BI dashboards where deploying a distributed catalog like Glue or Nessie is unnecessary.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Explore Hudi, Paimon, or Fluss&lt;/h3&gt;
&lt;p&gt;For specialized architectures, formats like Apache Hudi, Apache Paimon, or Apache Fluss may be appropriate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High-Frequency Streaming Upserts&lt;/strong&gt;: If your primary workload is real-time streaming ingestion with high rates of row-level updates and deletes (such as a financial trading log or real-time inventory system), Apache Hudi should be evaluated. Hudi was designed specifically for incremental processing and features advanced indexing and merge strategies that optimize streaming writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon and Fluss&lt;/strong&gt;: These formats are designed to integrate tightly with real-time stream processing engines like Apache Flink. If your architecture is built around continuous streaming queries, real-time materialized views, and low-latency stream analytics, Paimon and Fluss provide optimized storage layers that match Flink&apos;s processing model.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Format Comparison Matrix&lt;/h3&gt;
&lt;p&gt;To help guide the decision-making process, let us summarize the key differences in a structured format:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Capability / Feature&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Apache Iceberg&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Delta Lake&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Apache Hudi&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;DuckLake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Apache Foundation&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Linux Foundation&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Apache Foundation&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Open Source (Community)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Primary Sponsor&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Multi-vendor (Snowflake, AWS, Cloudera)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Databricks&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Onehouse&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;DuckDB Community&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Ecosystem Neutrality&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;High (Excellent cross-engine support)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Medium (Optimized for Databricks)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Medium (Optimized for Spark/Flink)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Low (Focused on DuckDB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Streaming Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Good (Merge-on-Read)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Good (Buffered writes)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Excellent (Advanced indexing)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;N/A (Batch/Local)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;SCD Type 2 / CDC&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Excellent (SQL MERGE / MoR support)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Excellent (SQL MERGE support)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Excellent (Incremental log)&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Basic (Manual writes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Best Engine Fit&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Spark, Trino, Dremio, Athena, Snowflake&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Databricks Spark, Photon&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Spark Streaming, Flink&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;DuckDB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;5. Query Acceleration with the Dremio Engine&lt;/h2&gt;
&lt;p&gt;Regardless of which open table format you choose, query performance is ultimately determined by the execution engine. High-performance engines like the Dremio engine are designed to accelerate queries across these formats, minimizing the latency difference between the layouts.&lt;/p&gt;
&lt;p&gt;Let us look at how the Dremio engine optimizes queries over Apache Iceberg, Delta Lake, and other open tables.&lt;/p&gt;
&lt;h3&gt;Vectorized Memory Layouts (Apache Arrow)&lt;/h3&gt;
&lt;p&gt;The Dremio engine uses Apache Arrow as its in-memory data representation. Arrow is a columnar format designed for fast analytical processing.&lt;/p&gt;
&lt;p&gt;When Dremio executes a query, it reads data from the underlying Parquet files (the physical storage format for Iceberg, Delta, and Hudi) and maps it directly into Arrow memory buffers. Because Arrow is structured column-by-column, the engine can execute calculations across arrays of values in a single CPU instruction using SIMD. This vectorized execution model reduces CPU cycles and speeds up aggregations, joins, and filters over large tables.&lt;/p&gt;
&lt;p&gt;Furthermore, the Dremio engine executes its query processing operations directly in off-heap memory using C++ memory allocations. This design prevents Java Virtual Machine (JVM) garbage collection overhead, which often limits the performance of Java-based execution engines under heavy analytical loads. The in-memory data structures are aligned with modern CPU cache architectures, maximizing memory locality and minimizing hardware cache misses.&lt;/p&gt;
&lt;p&gt;Additionally, Dremio integrates with Apache Arrow Flight, a high-performance framework for streaming large datasets over the network. Arrow Flight replaces legacy JDBC and ODBC serialization protocols with a stream-oriented gRPC interface. This allows client applications, such as Python pandas/Polars scripts or business intelligence tools, to stream query results from Dremio directly into client memory without the CPU-intensive serialization and deserialization steps required by traditional database drivers, delivering end-to-end data acceleration.&lt;/p&gt;
&lt;h3&gt;Metadata Caching&lt;/h3&gt;
&lt;p&gt;Query planning in an open lakehouse requires reading metadata files to locate the data files that match a query&apos;s filters. If the metadata files are stored in remote cloud object storage, the latency of listing and reading these files can slow down query planning.&lt;/p&gt;
&lt;p&gt;Dremio mitigates this latency by maintaining a local coordinator metadata cache. The Dremio coordinator automatically caches table metadata (such as Iceberg manifests or Delta Lake logs) on fast local storage. When a query is submitted, Dremio resolves the file paths from its local cache, reducing query planning time to milliseconds. This metadata caching allows Dremio to bypass the latency of S3 or ADLS API calls during query planning.&lt;/p&gt;
&lt;h3&gt;SQL Reflections&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s Data Reflections provide a powerful optimization mechanism. A Reflection is an accelerated physical representation of a table&apos;s data, stored in Parquet format and managed automatically by Dremio.&lt;/p&gt;
&lt;p&gt;When a query is run, Dremio&apos;s optimizer (powered by Apache Calcite) checks if an active Reflection can satisfy the query. If a match is found, Dremio automatically rewrites the query plan to read from the Reflection instead of scanning the source table.&lt;/p&gt;
&lt;p&gt;This is highly beneficial for table format evaluations. For example, if a query on a raw Iceberg table is slow due to complex joins, we can build a Raw or Aggregation Reflection. The queries will be redirected to the Reflection, delivering sub-second responses without requiring us to change our SQL queries or migrate our table format.&lt;/p&gt;
&lt;h3&gt;Positional and Equality Delete Caching&lt;/h3&gt;
&lt;p&gt;As explored in previous guides, writing updates to Merge-on-Read tables generates positional or equality delete files. At read time, these delete files must be applied to the base data files to filter out modified rows, which is a major performance bottleneck for query engines.&lt;/p&gt;
&lt;p&gt;The Dremio engine optimizes this reconciliation by caching delete files in memory. When reading an Iceberg table, Dremio loads the delete information into memory. As the vectorized reader scans base data files, it filters out deleted rows in memory on the fly. This caching eliminates the need to repeatedly fetch delete files from cloud storage, minimizing the read penalty associated with Merge-on-Read datasets.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Real-World Execution: Running a Comparative Workload Query&lt;/h2&gt;
&lt;p&gt;To show how the Dremio engine accelerates a typical analytical workload across these tables, let us write a benchmark query that aggregates sales performance from our standard tables: &lt;code&gt;analytics.orders&lt;/code&gt; and &lt;code&gt;analytics.customers&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Suppose we want to compute the total revenue and order counts for customers in California (&apos;CA&apos;) and New York (&apos;NY&apos;) for orders placed in the first half of 2026. The SQL query is structured as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT c.name, c.email, COUNT(o.order_id) AS order_count, SUM(o.amount) AS total_spent
  FROM rest_catalog.analytics.customers c
  JOIN rest_catalog.analytics.orders o
    ON c.customer_id = o.customer_id
 WHERE c.state IN (&apos;CA&apos;, &apos;NY&apos;)
   AND o.order_date BETWEEN DATE &apos;2026-01-01&apos; AND DATE &apos;2026-06-30&apos;
 GROUP BY c.name, c.email
 ORDER BY total_spent DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let us look at how the Dremio engine accelerates this join execution:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Metadata Pruning&lt;/strong&gt;: Dremio queries the coordinator metadata cache to resolve the active snapshots for both tables. It uses the filter &lt;code&gt;o.order_date BETWEEN &apos;2026-01-01&apos; AND &apos;2026-06-30&apos;&lt;/code&gt; to prune manifest files, identifying only the Parquet files that contain data for that date range.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column Projection&lt;/strong&gt;: Dremio reads only the column chunks needed for the query (&lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt; from customers; &lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt; from orders). It ignores all other columns, reducing network IO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized Hash Join&lt;/strong&gt;: Dremio loads the pruned data into Apache Arrow memory buffers. It builds a hash table on &lt;code&gt;c.customer_id&lt;/code&gt; using SIMD operations, and streams the order data through the hash table to perform the join.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflection Acceleration&lt;/strong&gt;: If we have a Raw Reflection containing the joined tables, Dremio&apos;s Calcite optimizer rewrites the plan to read directly from the Reflection, bypassing the join operation entirely and returning results in milliseconds.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2&gt;7. Conclusion&lt;/h2&gt;
&lt;p&gt;When choosing an open table format, data teams should look beyond the numbers presented in vendor-sponsored benchmarks. Performance is not an intrinsic property of a table format; it is a dynamic outcome determined by cluster sizing, engine optimizations, library versions, and cloud network storage latency.&lt;/p&gt;
&lt;p&gt;Synthetic benchmarks like TPC-DS are useful for testing engine boundaries, but they do not reflect the complexity of real-world pipelines. A workload-centric evaluation using your own data profiles, ingestion rates, and query patterns is the only reliable way to evaluate performance.&lt;/p&gt;
&lt;p&gt;In terms of ecosystem alignment, Apache Iceberg is the recommended default choice for most enterprises due to its open governance and broad cross-engine support. Delta Lake is appropriate for Databricks-centric environments, DuckLake is ideal for small-scale DuckDB workflows, and specialized formats like Hudi or Paimon should be reserved for high-frequency streaming architectures.&lt;/p&gt;
&lt;p&gt;Finally, by deploying high-performance execution layers like the Dremio engine, organizations can accelerate queries across all formats. Through vectorized execution using Apache Arrow, metadata caching, and SQL Reflections, Dremio delivers the speed required for modern analytics, allowing data teams to focus on building value rather than worrying about formatting constraints.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg SCD Type 2 and CDC Patterns: Building Historical Lakehouse Tables</title><link>https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-scd-type-2-cdc-patterns/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-scd-type-2-cdc-patterns/</guid><description>
In modern analytical systems, capturing and preserving historical changes in data is a critical requirement. Organizations need to understand not onl...</description><pubDate>Fri, 22 May 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;In modern analytical systems, capturing and preserving historical changes in data is a critical requirement. Organizations need to understand not only the current state of their business entities but also how those entities evolved over time. Traditionally, relational databases and proprietary data warehouses were the primary platforms for tracking these changes. However, with the rise of open data lakehouses, data engineers are now tasked with implementing these historical data patterns on top of cloud object storage.&lt;/p&gt;
&lt;p&gt;Two fundamental techniques for managing historical data in analytical repositories are Change Data Capture (CDC) and Slowly Changing Dimensions (SCD), particularly Slowly Changing Dimension Type 2 (SCD Type 2). Change Data Capture represents the process of identifying and capturing changes made to a source database and delivering those changes to a downstream target. Slowly Changing Dimension Type 2 is a modeling technique where historical records are preserved by creating new rows for each change, using start dates, end dates, and active flags to denote validity periods.&lt;/p&gt;
&lt;p&gt;Implementing CDC and SCD Type 2 on top of object storage was historically difficult because of the limitations of legacy file formats like Parquet, ORC, or JSON. Without transactional guarantees, concurrency controls, or row-level mutability, updating a data lake table meant overwriting large collections of files. This introduced significant operational overhead, risked data corruption, and limited the frequency of data updates.&lt;/p&gt;
&lt;p&gt;Apache Iceberg addresses these challenges by bringing transaction guarantees, metadata management, and row-level operations to the open lakehouse. By decoupled metadata from physical storage, Iceberg allows data engines to perform ACID transactions, run snapshot isolation queries, and manage historical table versions without modifying downstream readers.&lt;/p&gt;
&lt;p&gt;This comprehensive guide details the design patterns, architectural considerations, and implementation steps for building robust CDC and SCD Type 2 pipelines on Apache Iceberg. We will look at deduplication strategies, PySpark and Spark SQL merge patterns, time travel analysis, and how high-performance engines like the Dremio engine optimize query execution over these historical table structures.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. Changing Dimension Patterns and CDC Pipelines in Open Lakehouses&lt;/h2&gt;
&lt;p&gt;To build a reliable historical pipeline, we must first define the core patterns and the architecture that moves data from relational database management systems (RDBMS) to the open lakehouse. Change Data Capture and Slowly Changing Dimensions operate at different stages of the data integration lifecycle, but they work together to ensure that no state transitions are lost.&lt;/p&gt;
&lt;h3&gt;The Role of CDC Ingestion&lt;/h3&gt;
&lt;p&gt;Change Data Capture pipelines capture row-level modifications (inserts, updates, and deletes) from source systems in real time or near-real time. A typical CDC architecture relies on a log-based capture mechanism, such as Debezium, which monitors the transaction logs of databases like PostgreSQL, MySQL, or Oracle.&lt;/p&gt;
&lt;p&gt;Once a change is detected, the CDC engine publishes the event to a message broker like Apache Kafka or Apache Pulsar. These events contain both the old state of the row and the new state, along with metadata such as the change type (Insert, Update, Delete) and a source database transaction timestamp.&lt;/p&gt;
&lt;p&gt;The events are then ingested from the broker by a stream processing framework or micro-batch engine, such as Apache Spark Structured Streaming, Apache Flink, or a custom PySpark execution job. The ingestion process writes these change events into a landing table or raw storage zone in the lakehouse. This raw zone, often called the bronze layer, acts as an append-only log of all changes captured from the source systems.&lt;/p&gt;
&lt;h3&gt;Slowly Changing Dimension Type 2 (SCD Type 2) Mechanics&lt;/h3&gt;
&lt;p&gt;While the landing zone stores a flat history of changes, analytical users need a clean, structured representation of this history. This is where Slowly Changing Dimension Type 2 is applied.&lt;/p&gt;
&lt;p&gt;SCD Type 2 tracks historical updates by creating a new record for every change. Unlike SCD Type 1, which simply overwrites existing values, SCD Type 2 retains the old values and appends the new values as a separate row. To manage these historical rows, the table schema is enriched with specific tracking columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Business Key / Natural Key&lt;/strong&gt;: The identifier that links the records back to the source system entity (such as &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;order_id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Surrogate Key&lt;/strong&gt;: A unique identifier generated within the lakehouse to identify each version of a record.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Effective Start Timestamp&lt;/strong&gt;: The time when the record version became valid.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Effective End Timestamp&lt;/strong&gt;: The time when the record version ceased to be valid. If the record is active, this value is set to a distant future date (such as &lt;code&gt;9999-12-31 23:59:59&lt;/code&gt;) or left as null.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Is Current Flag&lt;/strong&gt;: A boolean indicator (true or false) or a status string indicating whether the record version represents the active state of the entity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By storing data in this format, users can easily query the current state of any customer or order by filtering for rows where &lt;code&gt;is_current&lt;/code&gt; is true. At the same time, users can query the state of any entity at a specific point in time by writing filter clauses that match the validity window: &lt;code&gt;target_timestamp BETWEEN effective_start_timestamp AND effective_end_timestamp&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Architectural Challenges in the Lakehouse&lt;/h3&gt;
&lt;p&gt;Building SCD Type 2 tables in a data lake house introduces unique challenges compared to relational databases.&lt;/p&gt;
&lt;p&gt;In a traditional database, updates and inserts are processed using index lookups and row-level locks. In a cloud data lakehouse, physical data is stored in immutable Parquet files. To apply an update, the processing engine must read the existing files, identify the affected rows, modify their metadata or data content, and write new files.&lt;/p&gt;
&lt;p&gt;This process can lead to the small files problem. If CDC updates are applied too frequently, the table becomes cluttered with small Parquet files, leading to high metadata overhead and slow read performance. Additionally, handling concurrent transactions (such as concurrent write jobs and query engines accessing the table) requires strong isolation levels to prevent dirty reads or lost updates.&lt;/p&gt;
&lt;p&gt;Apache Iceberg solves these issues by using snapshot-based metadata. When Spark or another engine writes data, it creates a new snapshot that points to the new files while retaining pointers to the old files. Readers continue to query the older snapshot until the write transaction commits. This design enables concurrent reads and writes, snapshot isolation, and efficient file pruning, making Iceberg the ideal format for CDC and SCD Type 2 pipelines.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. Ingestion Pipelines with Change Data Capture&lt;/h2&gt;
&lt;p&gt;Before applying change logs to our target SCD Type 2 dimension tables, we must capture and land the source change stream. We will define our core schemas based on the canonical entities: &lt;code&gt;analytics.orders&lt;/code&gt; and &lt;code&gt;analytics.customers&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For our examples, we will model the source changes representing customer data. Let us examine the structures of the source data and target tables.&lt;/p&gt;
&lt;h3&gt;Source Schema and Target Schema Definitions&lt;/h3&gt;
&lt;p&gt;The source changes are captured from a relational table mapped to &lt;code&gt;analytics.customers&lt;/code&gt;. The source schema contains the following fields:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;customer_id&lt;/code&gt; (Integer): The primary key of the customer record.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;name&lt;/code&gt; (String): The name of the customer.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;email&lt;/code&gt; (String): The email address of the customer.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;state&lt;/code&gt; (String): The geographical state where the customer resides.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;signup_date&lt;/code&gt; (Date): The date the customer registered.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our CDC stream, every event includes metadata fields indicating the operation type and order of operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;op_type&lt;/code&gt; (String): The operation type, where &apos;I&apos; is insert, &apos;U&apos; is update, and &apos;D&apos; is delete.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;source_ts&lt;/code&gt; (Timestamp): The timestamp when the operation occurred in the source database. This timestamp is critical for ordering events.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The historical target table, which we will maintain in Apache Iceberg, requires additional metadata fields to track the SCD Type 2 states:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;customer_id&lt;/code&gt; (Integer): The customer identifier.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;name&lt;/code&gt; (String): The customer&apos;s name.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;email&lt;/code&gt; (String): The customer&apos;s email.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;state&lt;/code&gt; (String): The customer&apos;s state.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;signup_date&lt;/code&gt; (Date): The registration date.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;effective_start&lt;/code&gt; (Timestamp): The start timestamp of the record version.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;effective_end&lt;/code&gt; (Timestamp): The end timestamp of the record version.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;is_current&lt;/code&gt; (Boolean): A flag indicating if the row is the active version.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Handling Late-Arriving Records and Source Ordering&lt;/h3&gt;
&lt;p&gt;One of the most complex aspects of building a CDC pipeline is handling out-of-order data and late-arriving records. Network latencies, retry mechanisms, and distributed queuing systems can cause CDC events to arrive out of order.&lt;/p&gt;
&lt;p&gt;For example, an update event for &lt;code&gt;customer_id = 100&lt;/code&gt; might arrive before the insert event for the same customer. If the pipeline processes events strictly in the order they are received, it could overwrite a newer state with an older state, leading to data corruption.&lt;/p&gt;
&lt;p&gt;To prevent this, the pipeline must implement an ordering mechanism based on a source-provided sequence number or timestamp (&lt;code&gt;source_ts&lt;/code&gt;). Before writing a batch of updates to the Iceberg table, the ingestion engine must deduplicate the batch, retaining only the latest event for each business key.&lt;/p&gt;
&lt;p&gt;If multiple updates for the same business key exist within a single micro-batch, we must perform windowing logic to select the record with the maximum &lt;code&gt;source_ts&lt;/code&gt;. This ensures that only the latest state is merged into the historical table, while previous states are either discarded or written as intermediate historical records.&lt;/p&gt;
&lt;h3&gt;CDC Ingestion Pipeline Flow&lt;/h3&gt;
&lt;p&gt;Let us look at the visual representation of how CDC event streams flow from source databases to the target Apache Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-mermaid&quot;&gt;graph TD
    A[&amp;quot;Source RDBMS (Customers Table)&amp;quot;] --&amp;gt;|Transaction Log Capture| B[&amp;quot;Debezium / CDC Agent&amp;quot;]
    B --&amp;gt;|Publish JSON/Avro Events| C[&amp;quot;Kafka Topic (customer-cdc)&amp;quot;]
    C --&amp;gt;|Read Micro-Batches| D[&amp;quot;PySpark Ingestion Engine&amp;quot;]
    D --&amp;gt;|Deduplicate and Window| E[&amp;quot;PySpark Deduplicated Batch&amp;quot;]
    E --&amp;gt;|SCD Type 2 Merge Logic| F[&amp;quot;Apache Iceberg Target (analytics.customers)&amp;quot;]
    F --&amp;gt;|Query Execution| G[&amp;quot;Dremio Query Engine&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This architecture ensures that source database modifications are captured immediately, staged in a broker, cleaned of duplicates, and merged transactionally into the target Iceberg dataset.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;3. Implementing CDC Pipelines with PySpark&lt;/h2&gt;
&lt;p&gt;Now that we understand the ingestion architecture, we will build a PySpark execution script that reads CDC events, deduplicates them, and prepares them for the SCD Type 2 merge process.&lt;/p&gt;
&lt;h3&gt;PySpark Environment Setup&lt;/h3&gt;
&lt;p&gt;To build the pipeline, we must configure a Spark Session with the Apache Iceberg dependencies. The following code configures a local spark environment to write to an Iceberg REST catalog.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, desc
from pyspark.sql.window import Window

/* Configure PySpark with the Apache Iceberg runtime jar and REST catalog */
spark = SparkSession.builder \
    .appName(&amp;quot;IcebergCDCPipeline&amp;quot;) \
    .config(&amp;quot;spark.jars.packages&amp;quot;, &amp;quot;org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0&amp;quot;) \
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog&amp;quot;, &amp;quot;org.apache.iceberg.spark.SparkCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.type&amp;quot;, &amp;quot;rest&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.uri&amp;quot;, &amp;quot;http://localhost:8181&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.warehouse&amp;quot;, &amp;quot;s3a://lakehouse-warehouse/&amp;quot;) \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Deduplication Logic for Micro-Batches&lt;/h3&gt;
&lt;p&gt;Before merging new CDC data into the target Iceberg table, we must handle scenarios where a single micro-batch contains multiple records for the same customer.&lt;/p&gt;
&lt;p&gt;The following python function receives a raw DataFrame of incoming CDC events, defines a window partitioned by &lt;code&gt;customer_id&lt;/code&gt; and ordered by the source timestamp &lt;code&gt;source_ts&lt;/code&gt; descending, and extracts only the latest change event for each customer.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def deduplicate_cdc_batch(raw_df):
    /*
    Deduplicate the incoming CDC DataFrame using window functions.
    This selects the latest state based on the source transaction timestamp.
    */
    window_spec = Window.partitionBy(&amp;quot;customer_id&amp;quot;).orderBy(desc(&amp;quot;source_ts&amp;quot;))

    deduplicated_df = raw_df \
        .withColumn(&amp;quot;row_num&amp;quot;, row_number().over(window_spec)) \
        .filter(col(&amp;quot;row_num&amp;quot;) == 1) \
        .drop(&amp;quot;row_num&amp;quot;)

    return deduplicated_df
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let us write a test script that simulates a raw micro-batch of customer events containing updates, inserts, and duplicate entries. We will execute the deduplication function and display the results.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Simulate raw CDC records, including duplicate keys for customer 1
raw_cdc_data = [
    (1, &amp;quot;Alice Smith&amp;quot;, &amp;quot;alice.smith@example.com&amp;quot;, &amp;quot;CA&amp;quot;, &amp;quot;2026-01-10&amp;quot;, &amp;quot;I&amp;quot;, &amp;quot;2026-05-22 10:00:00&amp;quot;),
    (1, &amp;quot;Alice Jones&amp;quot;, &amp;quot;alice.jones@example.com&amp;quot;, &amp;quot;NY&amp;quot;, &amp;quot;2026-01-10&amp;quot;, &amp;quot;U&amp;quot;, &amp;quot;2026-05-22 10:05:00&amp;quot;),
    (2, &amp;quot;Bob Miller&amp;quot;, &amp;quot;bob.miller@example.com&amp;quot;, &amp;quot;TX&amp;quot;, &amp;quot;2026-02-15&amp;quot;, &amp;quot;I&amp;quot;, &amp;quot;2026-05-22 10:01:00&amp;quot;),
    (3, &amp;quot;Charlie Davis&amp;quot;, &amp;quot;charlie@example.com&amp;quot;, &amp;quot;FL&amp;quot;, &amp;quot;2026-03-20&amp;quot;, &amp;quot;I&amp;quot;, &amp;quot;2026-05-22 10:02:00&amp;quot;),
    (2, &amp;quot;Bob Miller&amp;quot;, &amp;quot;bob.m@example.com&amp;quot;, &amp;quot;TX&amp;quot;, &amp;quot;2026-02-15&amp;quot;, &amp;quot;U&amp;quot;, &amp;quot;2026-05-22 10:08:00&amp;quot;)
]

# Define schema for the incoming CDC stream
cdc_columns = [&amp;quot;customer_id&amp;quot;, &amp;quot;name&amp;quot;, &amp;quot;email&amp;quot;, &amp;quot;state&amp;quot;, &amp;quot;signup_date&amp;quot;, &amp;quot;op_type&amp;quot;, &amp;quot;source_ts&amp;quot;]

# Create DataFrame
raw_cdc_df = spark.createDataFrame(raw_cdc_data, schema=cdc_columns)

# Cast source_ts to timestamp
raw_cdc_df = raw_cdc_df.withColumn(&amp;quot;source_ts&amp;quot;, col(&amp;quot;source_ts&amp;quot;).cast(&amp;quot;timestamp&amp;quot;))

# Deduplicate batch
cleaned_cdc_df = deduplicate_cdc_batch(raw_cdc_df)
cleaned_cdc_df.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output of the deduplication script shows that the multiple records for Alice (customer 1) and Bob (customer 2) are resolved, leaving only the records corresponding to the latest timestamp.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Merging CDC Streams using Spark SQL&lt;/h2&gt;
&lt;p&gt;Once the batch is deduplicated, we must merge it into the target Apache Iceberg SCD Type 2 table. This operation requires updating active rows that have changed (end-dating them) and inserting new rows (both for new entities and for the new versions of updated entities).&lt;/p&gt;
&lt;p&gt;We can achieve this using the Spark SQL &lt;code&gt;MERGE INTO&lt;/code&gt; statement. Let us examine the logic and compare how Copy-on-Write and Merge-on-Read table modes handle this operation.&lt;/p&gt;
&lt;h3&gt;Initializing the Target Table&lt;/h3&gt;
&lt;p&gt;First, we must create the target Iceberg table if it does not exist. We will include the SCD Type 2 columns in the definition.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE IF NOT EXISTS rest_catalog.analytics.customers (
    customer_id INT,
    name STRING,
    email STRING,
    state STRING,
    signup_date DATE,
    effective_start TIMESTAMP,
    effective_end TIMESTAMP,
    is_current BOOLEAN
)
USING iceberg
TBLPROPERTIES (
    &apos;write.format.default&apos; = &apos;parquet&apos;,
    &apos;write.merge.mode&apos; = &apos;merge-on-read&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The SCD Type 2 Merge Pattern&lt;/h3&gt;
&lt;p&gt;A standard &lt;code&gt;MERGE INTO&lt;/code&gt; statement matches incoming records with target rows based on the business key. However, in an SCD Type 2 target, a single business key can have multiple rows representing historical states. We must ensure that we match only against the active row (where &lt;code&gt;is_current&lt;/code&gt; is true).&lt;/p&gt;
&lt;p&gt;Furthermore, when an update occurs, we must perform two actions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Update the existing active record in the target table by setting &lt;code&gt;is_current&lt;/code&gt; to false and &lt;code&gt;effective_end&lt;/code&gt; to the update timestamp.&lt;/li&gt;
&lt;li&gt;Insert the new version of the record with &lt;code&gt;is_current&lt;/code&gt; to true, &lt;code&gt;effective_start&lt;/code&gt; to the update timestamp, and &lt;code&gt;effective_end&lt;/code&gt; to the distant future date.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because standard SQL &lt;code&gt;MERGE INTO&lt;/code&gt; executes a single action per matched row, we must structure our merge query to output multiple rows for each update. A common pattern is to write a query that joins the target table with the deduplicated changes, generates rows for the updates, and merges that combined set into the target table.&lt;/p&gt;
&lt;p&gt;Let us look at the SQL query that executes this SCD Type 2 merge pattern. We will run this query inside our PySpark script.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Register the deduplicated updates as a temporary view
cleaned_cdc_df.createOrReplaceTempView(&amp;quot;deduped_updates&amp;quot;)

# Execute the SCD Type 2 merge query
spark.sql(&amp;quot;&amp;quot;&amp;quot;
MERGE INTO rest_catalog.analytics.customers AS target
USING (
    /*
    This subquery prepares the staging data for the merge.
    It contains:
    - New records (inserts) that do not exist in the target table.
    - Updated records that must be inserted as new active versions.
    - An update marker record to update the end dates of existing active records.
    */
    SELECT
        NULL AS merge_key,
        u.customer_id,
        u.name,
        u.email,
        u.state,
        u.signup_date,
        u.source_ts AS effective_start,
        CAST(&apos;9999-12-31 23:59:59&apos; AS TIMESTAMP) AS effective_end,
        true AS is_current
    FROM deduped_updates u

    UNION ALL

    SELECT
        u.customer_id AS merge_key,
        u.customer_id,
        u.name,
        u.email,
        u.state,
        u.signup_date,
        u.source_ts AS effective_start,
        u.source_ts AS effective_end,
        false AS is_current
    FROM deduped_updates u
    JOIN rest_catalog.analytics.customers t
      ON u.customer_id = t.customer_id
     WHERE t.is_current = true
) AS source
ON target.customer_id = source.merge_key
   AND target.is_current = true
WHEN MATCHED THEN
    /* For existing active records matched by the update marker, end-date the record */
    UPDATE SET
        target.effective_end = source.effective_end,
        target.is_current = false
WHEN NOT MATCHED THEN
    /* For new inserts and new active versions of updated records, insert the row */
    INSERT (
        customer_id,
        name,
        email,
        state,
        signup_date,
        effective_start,
        effective_end,
        is_current
    )
    VALUES (
        source.customer_id,
        source.name,
        source.email,
        source.state,
        source.signup_date,
        source.effective_start,
        source.effective_end,
        source.is_current
    );
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Analysis of the Merge Query Logic&lt;/h3&gt;
&lt;p&gt;Let us break down the mechanics of this merge query:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;** Staging Subquery (&lt;code&gt;USING&lt;/code&gt; clause)**: The subquery performs a union operation to create a unified change feed. The first branch selects the incoming change records, setting &lt;code&gt;merge_key&lt;/code&gt; to null. Because &lt;code&gt;merge_key&lt;/code&gt; is null, these records will never match an existing row in the target table based on the &lt;code&gt;ON target.customer_id = source.merge_key&lt;/code&gt; clause. This forces the merge engine to execute the &lt;code&gt;WHEN NOT MATCHED&lt;/code&gt; action, inserting these rows as new active versions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Second Branch of the Union&lt;/strong&gt;: This branch selects the incoming changes that match an active row in the target table. It retains the &lt;code&gt;customer_id&lt;/code&gt; as the &lt;code&gt;merge_key&lt;/code&gt;. When joined in the &lt;code&gt;ON&lt;/code&gt; clause, this &lt;code&gt;merge_key&lt;/code&gt; matches the existing active target row. This triggers the &lt;code&gt;WHEN MATCHED&lt;/code&gt; action.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Matched Action&lt;/strong&gt;: The matched action updates the existing active target row. It sets &lt;code&gt;is_current&lt;/code&gt; to false and updates &lt;code&gt;effective_end&lt;/code&gt; to the source transaction timestamp. This effectively retires the old version of the record.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Not Matched Action&lt;/strong&gt;: The not matched action inserts the new rows. This includes both brand-new customer inserts and the new active versions of existing customers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This single-pass merge query guarantees transactional safety. Because Apache Iceberg supports atomic multi-file commits, either all modifications (end-dating old rows and writing new rows) succeed, or none do. This avoids partial updates that could leave the SCD Type 2 table in an inconsistent state.&lt;/p&gt;
&lt;h3&gt;Copy-on-Write vs. Merge-on-Read Mechanics&lt;/h3&gt;
&lt;p&gt;Apache Iceberg supports two modes for writing data modifications: Copy-on-Write (CoW) and Merge-on-Read (MoR). The choice of mode significantly impacts merge query performance and file layouts.&lt;/p&gt;
&lt;p&gt;In Copy-on-Write mode, any update or delete operation requires the engine to read the existing Parquet data file, apply the modification in memory, and write a new Parquet file containing the modified data. For SCD Type 2 operations, CoW means that updating the &lt;code&gt;effective_end&lt;/code&gt; date of an old active record forces Spark to rewrite the entire data file containing that row.&lt;/p&gt;
&lt;p&gt;This introduces high write latency and write amplification, especially for large tables with low-frequency updates. However, CoW tables are highly optimized for read performance, as there are no extra files to resolve at query time.&lt;/p&gt;
&lt;p&gt;In Merge-on-Read mode, update and delete operations do not modify existing data files. Instead, the engine writes the modifications to separate files called delete files. There are two types of delete files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Positional Deletes&lt;/strong&gt;: These files store the file path and row position of the deleted or updated rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Equality Deletes&lt;/strong&gt;: These files store the value of the columns (such as &lt;code&gt;customer_id&lt;/code&gt;) that identify the deleted rows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When an update is executed in MoR mode, Spark writes the new data rows to a new data file, and writes the positions of the updated target rows to a positional delete file. This eliminates write amplification and speeds up ingestion times.&lt;/p&gt;
&lt;p&gt;The tradeoff occurs during read execution. When a query engine reads an MoR table, it must read the data files, read the delete files, and apply the deletes in memory to filter out modified rows. This can degrade read performance if the table is not compacted regularly.&lt;/p&gt;
&lt;p&gt;For high-frequency CDC workloads, Merge-on-Read is the recommended configuration. We will explore how query acceleration engines like Dremio mitigate the read penalty of MoR tables in a subsequent section.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;5. Reconstructing Historical Table States&lt;/h2&gt;
&lt;p&gt;One of the greatest benefits of implementing SCD Type 2 tables in Apache Iceberg is the ability to reconstruct historical states and query data exactly as it existed at any point in time. This is achieved using standard SQL queries, Iceberg&apos;s metadata tables, and the native time travel features.&lt;/p&gt;
&lt;h3&gt;Querying the Current State&lt;/h3&gt;
&lt;p&gt;To retrieve the active state of all entities, users can write a straightforward query filtering for rows where &lt;code&gt;is_current&lt;/code&gt; is true.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT customer_id, name, email, state, signup_date
  FROM rest_catalog.analytics.customers
 WHERE is_current = true;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query returns the active profile for each customer, corresponding to the latest changes captured from the source database.&lt;/p&gt;
&lt;h3&gt;Querying State at a Specific Timestamp (Point-in-Time Queries)&lt;/h3&gt;
&lt;p&gt;To query the state of a customer or the entire dataset at a historical point in time, we write filters against the &lt;code&gt;effective_start&lt;/code&gt; and &lt;code&gt;effective_end&lt;/code&gt; columns. For example, to check the state of customer records as they existed on &lt;code&gt;2026-05-22 10:03:00&lt;/code&gt;, we run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT customer_id, name, email, state
  FROM rest_catalog.analytics.customers
 WHERE CAST(&apos;2026-05-22 10:03:00&apos; AS TIMESTAMP)
       BETWEEN effective_start AND effective_end;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query filters out any record versions that were not yet active or had already been end-dated by that timestamp, providing an accurate view of the database state at that precise moment.&lt;/p&gt;
&lt;h3&gt;Time Travel via Iceberg Snapshots&lt;/h3&gt;
&lt;p&gt;In addition to querying the columns of an SCD Type 2 table, we can leverage Apache Iceberg&apos;s snapshot history. Every write operation in Iceberg creates a new snapshot. We can perform time travel queries by specifying a snapshot ID or a historical timestamp.&lt;/p&gt;
&lt;p&gt;When we use Iceberg time travel, the engine reads the table metadata as it existed at that snapshot, ignoring any files written after that snapshot was created.&lt;/p&gt;
&lt;p&gt;To retrieve the history of snapshots for our target table, we query the &lt;code&gt;snapshots&lt;/code&gt; metadata table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT committed_at, snapshot_id, parent_id, operation
  FROM rest_catalog.analytics.customers.snapshots
 ORDER BY committed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we identify the snapshot ID or timestamp we wish to inspect, we can execute a time travel query using PySpark or Spark SQL.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# PySpark Time Travel by Snapshot ID
snapshot_df = spark.read \
    .option(&amp;quot;snapshot-id&amp;quot;, 8901234567890123456) \
    .table(&amp;quot;rest_catalog.analytics.customers&amp;quot;)

# PySpark Time Travel by Timestamp
historical_timestamp = &amp;quot;2026-05-22 10:02:00&amp;quot;
time_travel_df = spark.read \
    .option(&amp;quot;as-of-timestamp&amp;quot;, int(spark.sql(f&amp;quot;select unix_millis(cast(&apos;{historical_timestamp}&apos; as timestamp))&amp;quot;).collect()[0][0])) \
    .table(&amp;quot;rest_catalog.analytics.customers&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also write time travel queries directly in Spark SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Query the table as of a specific system timestamp */
SELECT *
  FROM rest_catalog.analytics.customers
       TIMESTAMP AS OF &apos;2026-05-22 10:02:00&apos;;

/* Query the table as of a specific snapshot ID */
SELECT *
  FROM rest_catalog.analytics.customers
       VERSION AS OF 8901234567890123456;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Contrast Between SCD Type 2 and Time Travel&lt;/h3&gt;
&lt;p&gt;It is important to distinguish between SCD Type 2 point-in-time queries and Iceberg metadata time travel:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SCD Type 2 Point-in-Time&lt;/strong&gt;: This query searches the &lt;em&gt;business&lt;/em&gt; history. It answers: &amp;quot;What was the customer&apos;s active email in the source system on May 22?&amp;quot; Even if we update the table today, the history remains stored in the rows of the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iceberg Metadata Time Travel&lt;/strong&gt;: This query searches the &lt;em&gt;system&lt;/em&gt; history. It answers: &amp;quot;What did the customer table look like inside our lakehouse before we ran our morning Spark ingestion job?&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If an incorrect merge operation corrupts data, metadata time travel allows us to inspect the pre-merge state and restore the table to the last known good snapshot. SCD Type 2, on the other hand, tracks the logical business transitions.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Query Acceleration with the Dremio Engine&lt;/h2&gt;
&lt;p&gt;While Spark is excellent for orchestrating heavy ETL merge operations, analytical users and business intelligence tools require fast, sub-second responses when querying these tables. Querying SCD Type 2 tables can be computationally expensive due to the complex date filters, joins, and the presence of delete files in Merge-on-Read tables.&lt;/p&gt;
&lt;p&gt;The Dremio engine acts as an acceleration layer that sits directly on top of open table formats like Apache Iceberg, delivering rapid query performance. Let us analyze the mechanisms that the Dremio engine uses to optimize queries over SCD Type 2 and CDC datasets.&lt;/p&gt;
&lt;h3&gt;Vectorized Memory Layouts (Apache Arrow)&lt;/h3&gt;
&lt;p&gt;At the core of the Dremio engine&apos;s execution model is Apache Arrow, a columnar, in-memory data representation. When Dremio processes a query, it reads data from Parquet files and loads it directly into Arrow memory buffers.&lt;/p&gt;
&lt;p&gt;Because Arrow and Parquet share a columnar structure, Dremio can transfer data from disk to memory with minimal CPU overhead. The vectorized execution model allows Dremio to apply filter conditions (such as &lt;code&gt;is_current = true&lt;/code&gt; or &lt;code&gt;target_timestamp BETWEEN effective_start AND effective_end&lt;/code&gt;) across arrays of values simultaneously, leveraging Single Instruction Multiple Data (SIMD) hardware capabilities. This is far faster than row-by-row processing, resulting in significant speedups for complex historical queries.&lt;/p&gt;
&lt;h3&gt;Metadata Caching&lt;/h3&gt;
&lt;p&gt;To plan a query, an engine must first parse the Iceberg metadata tree, starting from the table metadata JSON file, resolving the manifest list, and reading the individual manifest files. If the target catalog or cloud storage repository suffers from network latency, this metadata resolution phase can add seconds to query execution times.&lt;/p&gt;
&lt;p&gt;The Dremio engine solves this problem by using a local coordinator metadata cache. The Dremio coordinator node automatically caches Iceberg metadata files locally. When a query is submitted, Dremio reads the manifest trees from its fast local cache instead of making multiple API calls to object storage. This reduces query planning latency to milliseconds.&lt;/p&gt;
&lt;h3&gt;Automatic Query Rewrites with Data Reflections&lt;/h3&gt;
&lt;p&gt;One of Dremio&apos;s most powerful acceleration features is Data Reflections. A Reflection is an optimized physical representation of a table&apos;s data, stored in Parquet format, that is managed automatically by Dremio.&lt;/p&gt;
&lt;p&gt;For SCD Type 2 tables, we can define two types of Reflections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Raw Reflections&lt;/strong&gt;: These store a copy of the table sorted or partitioned by columns frequently used in query filters, such as &lt;code&gt;is_current&lt;/code&gt; or &lt;code&gt;customer_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregation Reflections&lt;/strong&gt;: These store pre-computed aggregates and dimensions for reporting queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a user executes a query, the Dremio query optimizer (which uses Apache Calcite) analyzes the query plan. If it finds a matching Reflection, it automatically rewrites the query to read from the Reflection instead of scanning the raw Iceberg table. This translation is completely transparent to the user; the user queries the original table, and Dremio accelerates the query behind the scenes.&lt;/p&gt;
&lt;p&gt;For instance, if we build a Raw Reflection partitioned by &lt;code&gt;is_current&lt;/code&gt;, Dremio can satisfy queries for the active customer profile by reading only the slice of the Reflection where &lt;code&gt;is_current&lt;/code&gt; is true, avoiding scans of the historical rows.&lt;/p&gt;
&lt;h3&gt;Positional and Equality Delete File Caching&lt;/h3&gt;
&lt;p&gt;As discussed earlier, running &lt;code&gt;MERGE INTO&lt;/code&gt; updates on Merge-on-Read Iceberg tables produces positional or equality delete files. When reading these tables, engines must merge these delete files with the base data files.&lt;/p&gt;
&lt;p&gt;This reconciliation is a major performance bottleneck in open lakehouses. If an engine has to fetch delete files from remote storage and apply them on the fly for every query, read latencies will grow.&lt;/p&gt;
&lt;p&gt;The Dremio engine optimizes this process by caching positional and equality delete files in memory. During execution, Dremio loads these delete sets into memory structures. As the vectorized reader scans base Parquet data files, it cross-references the cached delete keys and drops excluded rows on the fly in memory. By avoiding repetitive object storage accesses for delete files, Dremio makes querying Merge-on-Read Iceberg tables nearly as fast as querying Copy-on-Write tables.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;7. Operational Best Practices for Historical Dimensions&lt;/h2&gt;
&lt;p&gt;To maintain high performance and prevent storage costs from expanding, data engineers must run regular maintenance tasks on historical Iceberg tables. Let us review the primary operations required to manage these datasets.&lt;/p&gt;
&lt;h3&gt;Running Compaction&lt;/h3&gt;
&lt;p&gt;Over time, continuous CDC ingestion will result in the accumulation of many small files and delete files. We must compact these files into larger, optimized Parquet blocks.&lt;/p&gt;
&lt;p&gt;We can run compaction procedures in Apache Spark. The &lt;code&gt;rewrite_data_files&lt;/code&gt; procedure merges small data files and applies active deletes, creating clean consolidated Parquet files.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Run compaction on the customer table to merge small files and apply deletes */
CALL rest_catalog.system.rewrite_data_files(
    table =&amp;gt; &apos;analytics.customers&apos;,
    options =&amp;gt; map(
        &apos;max-file-group-size-bytes&apos;, &apos;536870912&apos;, /* 512MB */
        &apos;min-input-files&apos;, &apos;5&apos;
    )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For large tables, we can configure sort-based compaction or Z-Order sorting to place related records close together on disk. This improves file pruning for point-in-time queries.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Compact data files using Z-Order sorting on customer_id and state */
CALL rest_catalog.system.rewrite_data_files(
    table =&amp;gt; &apos;analytics.customers&apos;,
    strategy =&amp;gt; &apos;sort&apos;,
    sort_order =&amp;gt; &apos;ZORDER(customer_id, state)&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Snapshot Expiration and Orphan File Cleanup&lt;/h3&gt;
&lt;p&gt;Each compaction run and merge operation creates new snapshots, but the old snapshots and files remain in storage to support time travel. If left unchecked, this historical data will increase cloud storage costs.&lt;/p&gt;
&lt;p&gt;To manage this, we must configure a snapshot expiration policy. Expiring old snapshots removes the metadata pointers and makes the associated data files eligible for physical deletion.&lt;/p&gt;
&lt;p&gt;The following Spark statement expires snapshots older than 14 days, ensuring that we maintain a reasonable history window while controlling costs.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Expire snapshots older than 14 days */
CALL rest_catalog.system.expire_snapshots(
    table =&amp;gt; &apos;analytics.customers&apos;,
    older_than =&amp;gt; TIMESTAMP AS OF (current_timestamp() - INTERVAL 14 DAYS),
    retain_last =&amp;gt; 10
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After expiring snapshots, some physical files may remain in storage if they were not referenced by any metadata files. We clean these up using the &lt;code&gt;remove_orphan_files&lt;/code&gt; procedure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Remove orphan files that are no longer tracked by Iceberg metadata */
CALL rest_catalog.system.remove_orphan_files(
    table =&amp;gt; &apos;analytics.customers&apos;,
    older_than =&amp;gt; TIMESTAMP AS OF (current_timestamp() - INTERVAL 14 DAYS)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Metadata File Pruning&lt;/h3&gt;
&lt;p&gt;In tables with high transaction volumes, the metadata JSON files themselves can become large. We can configure table properties to prune metadata files automatically after every commit.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE rest_catalog.analytics.customers SET TBLPROPERTIES (
    &apos;write.metadata.delete-after-commit.enabled&apos; = &apos;true&apos;,
    &apos;write.metadata.previous-versions-max&apos; = &apos;50&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These settings keep the metadata footprints small, improving query planning performance for both Spark and the Dremio engine.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;8. Verifying a Sample Orders Pipeline&lt;/h2&gt;
&lt;p&gt;To ensure completeness, we will write a PySpark integration script that applies the same CDC and SCD Type 2 patterns to our other canonical dataset: &lt;code&gt;analytics.orders&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Target Orders Table Structure&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;analytics.orders&lt;/code&gt; table has the following columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;order_id&lt;/code&gt; (Integer): The primary identifier.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;customer_id&lt;/code&gt; (Integer): The customer who placed the order.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;order_date&lt;/code&gt; (Date): The date of the order.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status&lt;/code&gt; (String): The status of the order (such as &apos;PENDING&apos;, &apos;SHIPPED&apos;, &apos;DELIVERED&apos;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;amount&lt;/code&gt; (Double): The financial amount.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;effective_start&lt;/code&gt; (Timestamp): SCD Type 2 start timestamp.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;effective_end&lt;/code&gt; (Timestamp): SCD Type 2 end timestamp.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;is_current&lt;/code&gt; (Boolean): Active status flag.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Code Implementation for Orders Ingestion&lt;/h3&gt;
&lt;p&gt;The following script initializes the target orders table, processes a micro-batch of order status updates, and applies the SCD Type 2 merge logic.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Initialize the target table in PySpark
spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE IF NOT EXISTS rest_catalog.analytics.orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    status STRING,
    amount DOUBLE,
    effective_start TIMESTAMP,
    effective_end TIMESTAMP,
    is_current BOOLEAN
)
USING iceberg
TBLPROPERTIES (
    &apos;write.merge.mode&apos; = &apos;merge-on-read&apos;
);
&amp;quot;&amp;quot;&amp;quot;)

# Simulate raw CDC order transactions
raw_orders_data = [
    (1001, 1, &amp;quot;2026-05-20&amp;quot;, &amp;quot;PENDING&amp;quot;, 150.00, &amp;quot;2026-05-22 11:00:00&amp;quot;),
    (1001, 1, &amp;quot;2026-05-20&amp;quot;, &amp;quot;SHIPPED&amp;quot;, 150.00, &amp;quot;2026-05-22 11:15:00&amp;quot;),
    (1002, 2, &amp;quot;2026-05-21&amp;quot;, &amp;quot;PENDING&amp;quot;, 450.50, &amp;quot;2026-05-22 11:05:00&amp;quot;)
]

# Set schema for source streaming order events
orders_schema = [&amp;quot;order_id&amp;quot;, &amp;quot;customer_id&amp;quot;, &amp;quot;order_date&amp;quot;, &amp;quot;status&amp;quot;, &amp;quot;amount&amp;quot;, &amp;quot;source_ts&amp;quot;]
raw_orders_df = spark.createDataFrame(raw_orders_data, schema=orders_schema)
raw_orders_df = raw_orders_df.withColumn(&amp;quot;source_ts&amp;quot;, col(&amp;quot;source_ts&amp;quot;).cast(&amp;quot;timestamp&amp;quot;))

# Deduplicate micro-batch based on order_id and source timestamp
orders_window = Window.partitionBy(&amp;quot;order_id&amp;quot;).orderBy(desc(&amp;quot;source_ts&amp;quot;))
deduped_orders_df = raw_orders_df \
    .withColumn(&amp;quot;row_num&amp;quot;, row_number().over(orders_window)) \
    .filter(col(&amp;quot;row_num&amp;quot;) == 1) \
    .drop(&amp;quot;row_num&amp;quot;)

deduped_orders_df.createOrReplaceTempView(&amp;quot;deduped_orders&amp;quot;)

# Perform the SCD Type 2 merge on the orders table
spark.sql(&amp;quot;&amp;quot;&amp;quot;
MERGE INTO rest_catalog.analytics.orders AS target
USING (
    SELECT
        NULL AS merge_key,
        o.order_id,
        o.customer_id,
        o.order_date,
        o.status,
        o.amount,
        o.source_ts AS effective_start,
        CAST(&apos;9999-12-31 23:59:59&apos; AS TIMESTAMP) AS effective_end,
        true AS is_current
    FROM deduped_orders o

    UNION ALL

    SELECT
        o.order_id AS merge_key,
        o.order_id,
        o.customer_id,
        o.order_date,
        o.status,
        o.amount,
        o.source_ts AS effective_start,
        o.source_ts AS effective_end,
        false AS is_current
    FROM deduped_orders o
    JOIN rest_catalog.analytics.orders t
      ON o.order_id = t.order_id
     WHERE t.is_current = true
) AS source
ON target.order_id = source.merge_key
   AND target.is_current = true
WHEN MATCHED THEN
    UPDATE SET
        target.effective_end = source.effective_end,
        target.is_current = false
WHEN NOT MATCHED THEN
    INSERT (
        order_id,
        customer_id,
        order_date,
        status,
        amount,
        effective_start,
        effective_end,
        is_current
    )
    VALUES (
        source.order_id,
        source.customer_id,
        source.order_date,
        source.status,
        source.amount,
        source.effective_start,
        source.effective_end,
        source.is_current
    );
&amp;quot;&amp;quot;&amp;quot;)

# Display the resulting orders dataset
spark.table(&amp;quot;rest_catalog.analytics.orders&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output confirms that the PENDING state for order 1001 is updated to SHIPPED, and the validity windows are structured correctly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;9. Conclusion&lt;/h2&gt;
&lt;p&gt;Implementing Slowly Changing Dimension Type 2 modeling and Change Data Capture pipelines on an open data lakehouse was once limited by file formats. Apache Iceberg removes these limitations by introducing ACID transactions, snapshot isolation, and native row-level write modes to standard cloud object storage.&lt;/p&gt;
&lt;p&gt;By leveraging PySpark and Spark SQL&apos;s &lt;code&gt;MERGE INTO&lt;/code&gt; capabilities, data engineers can design pipelines that deduplicate incoming streaming micro-batches, handle out-of-order records, and construct SCD Type 2 validity windows in a single commit. Decoupling storage from query execution also allows organizations to run multiple compute engines on the same data.&lt;/p&gt;
&lt;p&gt;For querying historical states, the Dremio engine offers substantial performance improvements. Through vectorized execution using Apache Arrow, local metadata caching, data reflections, and in-memory delete file caching, Dremio allows business intelligence tools to query complex SCD Type 2 tables with sub-second response times.&lt;/p&gt;
&lt;p&gt;Combining Apache Iceberg&apos;s transactional storage with Spark&apos;s processing capabilities and Dremio&apos;s query acceleration enables organizations to build robust, scalable, and high-performance historical data engines directly on top of open data lakehouse structures.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Setting Up an AWS-Native Open Lakehouse: Querying Apache Iceberg with AWS Athena and AWS Glue Catalog</title><link>https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-aws-athena-glue/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-aws-athena-glue/</guid><description>
The architecture of modern data platforms is undergoing a fundamental shift away from proprietary, monolithic data warehouses toward open data lakeho...</description><pubDate>Fri, 22 May 2026 10:30:00 GMT</pubDate><content:encoded>&lt;p&gt;The architecture of modern data platforms is undergoing a fundamental shift away from proprietary, monolithic data warehouses toward open data lakehouses. In an open lakehouse architecture, data storage, metadata catalogs, and query compute engines are completely decoupled. This decoupling enables organizations to store their data once in an open format, catalog it in a centralized repository, and query it using the most efficient tool for each specific use case. AWS provides a powerful, native ecosystem for building such platforms, centered on Amazon Simple Storage Service (S3), the AWS Glue Catalog, and serverless query engines like AWS Athena.&lt;/p&gt;
&lt;p&gt;In this guide, we will explore how to design, configure, and operate an AWS-native open lakehouse using Apache Iceberg. We will walk through the configuration of IAM policies, directory structures, and the new Amazon S3 Tables storage class. We will also examine how to build tables, ingest data, and execute queries using AWS Athena, and then show how to achieve sub-second interactive query speeds by connecting a Dremio engine to the same Glue catalog.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. The Role of Apache Iceberg in AWS Lakehouses&lt;/h2&gt;
&lt;p&gt;Traditional data lakes on AWS relied on the Hive table format to organize files. Hive organized data into directory paths on S3, such as &lt;code&gt;s3://bucket/table/year=2026/month=05/&lt;/code&gt;. While this simple partition layout worked for basic batch jobs, it introduced significant performance bottlenecks and operational limitations as datasets scaled. In Hive, a query engine had to list all files in a directory to identify which datasets belonged to a table. For large tables with thousands of partitions, these file listing requests generated thousands of S3 API calls, causing high latency and throttling.&lt;/p&gt;
&lt;p&gt;Furthermore, Hive lacked ACID transaction support. If a write job failed halfway through, the tables were left in a corrupted, partially updated state. Schema evolution was also risky; renaming or dropping columns often required rewriting the entire dataset.&lt;/p&gt;
&lt;p&gt;Apache Iceberg solves these challenges by treating a table as a collection of files rather than a directory. Iceberg maintains a hierarchical tree of metadata files that track the exact state of the table at any point in time. This metadata structure provides several critical capabilities:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions&lt;/strong&gt;: Writers create new metadata files representing a snapshot of the table. A catalog swaps the table pointer from the old metadata file to the new one in a single atomic transaction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata-Based Query Planning&lt;/strong&gt;: Query engines do not list S3 directories. Instead, they read the Iceberg manifest files to identify the exact files needed for a query. This eliminates folder listings and minimizes S3 API request overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution&lt;/strong&gt;: Iceberg tracks partition layouts as metadata. You can change your partitioning strategy (for instance, switching from daily to hourly partitioning) without rewriting existing data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Column renames, additions, and type promotions are tracked in metadata, ensuring that schema modifications are instant and safe.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By placing Apache Iceberg at the center of an AWS data lake, organizations combine the low cost of S3 with the transactional reliability and performance of an enterprise data warehouse.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. Decoupled Architecture Components and Commit Orchestration&lt;/h2&gt;
&lt;p&gt;An AWS-native open lakehouse relies on three distinct layers that cooperate to process analytical queries.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-----------------------------------------------------------+
|                    QUERY COMPUTE LAYER                    |
|   +--------------------------+ +-----------------------+   |
|   |   AWS Athena (Serverless)| | Dremio Engine (Arrow) |   |
|   +--------------------------+ +-----------------------+   |
+-----------------------------+-----------------------------+
                              | (Read Metadata/Data)
                              v
+-----------------------------------------------------------+
|                      CATALOG LAYER                        |
|                  +-----------------------+                |
|                  |   AWS Glue Catalog    |                |
|                  +-----------------------+                |
+-----------------------------+-----------------------------+
                              | (Resolve Table Pointer)
                              v
+-----------------------------------------------------------+
|                      STORAGE LAYER                        |
|       +---------------------------------------------+     |
|       |                 Amazon S3                   |     |
|       |   (Standard Buckets / Amazon S3 Tables)     |     |
|       +---------------------------------------------+     |
+-----------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Amazon S3 and S3 Tables&lt;/h3&gt;
&lt;p&gt;Amazon S3 serves as the physical storage layer for data files (typically formatted as Parquet, ORC, or Avro) and Iceberg metadata files. Recently, AWS introduced Amazon S3 Tables, a specialized storage class designed specifically for tabular data.&lt;/p&gt;
&lt;p&gt;Standard S3 buckets are generic object stores that treat all files as isolated blobs. In contrast, Amazon S3 Tables are optimized to host open table formats like Apache Iceberg. S3 Tables natively manage table metadata, automate background maintenance operations, and offer up to a ten-fold increase in transaction rates compared to standard buckets. This storage class reduces the management overhead of manually maintaining tables while providing high-performance object access.&lt;/p&gt;
&lt;h3&gt;AWS Glue Catalog and Optimistic Concurrency Control&lt;/h3&gt;
&lt;p&gt;The catalog layer acts as the single source of truth for table identity and location. The AWS Glue Catalog is a managed, serverless metadata store that maintains schemas, partitions, and table definitions.&lt;/p&gt;
&lt;p&gt;For Apache Iceberg, the Glue Catalog stores a reference pointing to the current metadata JSON file of each table. When an engine writes to an Iceberg table, it writes new data and metadata files to S3, and then commits the write by instructing Glue to update the table pointer. Glue performs this pointer swap atomically.&lt;/p&gt;
&lt;p&gt;Behind the scenes, Iceberg uses Optimistic Concurrency Control (OCC) to coordinate transactions. When a transaction begins, the client engine reads the current table metadata pointer from the Glue Catalog and records the snapshot version. The client then writes new data files to S3 and creates new manifest files and metadata JSON files.&lt;/p&gt;
&lt;p&gt;During the commit phase, the client requests the Glue Catalog to perform an atomic compare-and-swap (CAS) operation. The catalog verifies whether the current table pointer in Glue matches the version the client read at the start of the transaction. If the version matches, Glue updates the pointer to the new metadata JSON file, and the transaction is committed.&lt;/p&gt;
&lt;p&gt;If another client has updated the table in the meantime, the version check fails. The committing client must abort the transaction, discard its temporary metadata files, reread the updated pointer, reconcile any non-conflicting changes (for instance, if the two transactions updated different partitions), and attempt the write again. This protocol guarantees transaction isolation without requiring physical database locks on the underlying files.&lt;/p&gt;
&lt;h3&gt;AWS Athena&lt;/h3&gt;
&lt;p&gt;AWS Athena is a serverless, interactive query engine based on Presto and Trino. Athena queries Iceberg tables directly on S3 using schemas defined in the Glue Catalog. Because Athena is serverless, you pay only for the data scanned by your queries. It is ideal for ad-hoc exploration, reporting, and building lightweight dashboards.&lt;/p&gt;
&lt;h3&gt;Dremio Engine&lt;/h3&gt;
&lt;p&gt;While Athena is excellent for ad-hoc queries, interactive business intelligence (BI) dashboards often require sub-second query responses. The Dremio engine is an open lakehouse query accelerator that integrates directly with the Glue Catalog and S3. Dremio bypasses the latency of standard object storage using an in-memory execution engine built on Apache Arrow, local metadata caching, and SQL Reflections. By pointing both Athena and Dremio to the same Glue Catalog, you can use Athena for batch transformations and ad-hoc queries, and Dremio for high-speed dashboarding and interactive analytics.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;3. Designing the Infrastructure: Security, Rate Limits, and Hashing&lt;/h2&gt;
&lt;p&gt;Before querying tables, we must configure IAM policies, directory structures, and storage access patterns to ensure secure, high-throughput operations.&lt;/p&gt;
&lt;h3&gt;IAM Policy Design and Action Descriptions&lt;/h3&gt;
&lt;p&gt;To interact with Iceberg tables via Athena and Glue, query engines require permissions to read and write data in S3, update metadata in Glue, and execute queries in Athena. Below is a secure, least-privilege IAM policy template designed for this architecture.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;S3BucketAccess&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Action&amp;quot;: [
        &amp;quot;s3:GetObject&amp;quot;,
        &amp;quot;s3:PutObject&amp;quot;,
        &amp;quot;s3:DeleteObject&amp;quot;,
        &amp;quot;s3:ListBucket&amp;quot;
      ],
      &amp;quot;Resource&amp;quot;: [
        &amp;quot;arn:aws:s3:::my-lakehouse-bucket&amp;quot;,
        &amp;quot;arn:aws:s3:::my-lakehouse-bucket/*&amp;quot;
      ]
    },
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;GlueCatalogAccess&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Action&amp;quot;: [
        &amp;quot;glue:GetDatabase&amp;quot;,
        &amp;quot;glue:GetDatabases&amp;quot;,
        &amp;quot;glue:CreateDatabase&amp;quot;,
        &amp;quot;glue:GetTable&amp;quot;,
        &amp;quot;glue:GetTables&amp;quot;,
        &amp;quot;glue:CreateTable&amp;quot;,
        &amp;quot;glue:UpdateTable&amp;quot;,
        &amp;quot;glue:DeleteTable&amp;quot;
      ],
      &amp;quot;Resource&amp;quot;: [
        &amp;quot;arn:aws:glue:us-east-1:123456789012:catalog&amp;quot;,
        &amp;quot;arn:aws:glue:us-east-1:123456789012:database/analytics&amp;quot;,
        &amp;quot;arn:aws:glue:us-east-1:123456789012:table/analytics/*&amp;quot;
      ]
    },
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;AthenaExecutionAccess&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Action&amp;quot;: [
        &amp;quot;athena:StartQueryExecution&amp;quot;,
        &amp;quot;athena:GetQueryExecution&amp;quot;,
        &amp;quot;athena:GetQueryResults&amp;quot;,
        &amp;quot;athena:StopQueryExecution&amp;quot;
      ],
      &amp;quot;Resource&amp;quot;: &amp;quot;*&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let us break down why specific actions are required:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;s3:GetObject&lt;/code&gt; and &lt;code&gt;s3:PutObject&lt;/code&gt; are necessary to retrieve and write Parquet files and Iceberg metadata files to the bucket.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;s3:DeleteObject&lt;/code&gt; is required for table maintenance, such as expiring old snapshots and removing orphan files that are no longer referenced by the metadata.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;s3:ListBucket&lt;/code&gt; allows the client to list objects within specific prefixes during validation checks or maintenance tasks.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;glue:GetTable&lt;/code&gt; and &lt;code&gt;glue:CreateTable&lt;/code&gt; allow query engines to resolve table schemas and locations, and write new table definitions.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;glue:UpdateTable&lt;/code&gt; is the critical action used during commit operations, enabling the atomic pointer swap that updates the table metadata location.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;S3 Cross-Account Access Policy Design&lt;/h3&gt;
&lt;p&gt;In many enterprise setups, storage is centralized in a dedicated security account, while compute engines run in separate analytics accounts. To allow query engines in Account A (ID: &lt;code&gt;111111111111&lt;/code&gt;) to access the bucket in Account B (ID: &lt;code&gt;222222222222&lt;/code&gt;), we must apply a cross-account bucket policy in Account B:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;CrossAccountAnalyticsAccess&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Principal&amp;quot;: {
        &amp;quot;AWS&amp;quot;: &amp;quot;arn:aws:iam::111111111111:root&amp;quot;
      },
      &amp;quot;Action&amp;quot;: [
        &amp;quot;s3:GetObject&amp;quot;,
        &amp;quot;s3:PutObject&amp;quot;,
        &amp;quot;s3:DeleteObject&amp;quot;,
        &amp;quot;s3:ListBucket&amp;quot;
      ],
      &amp;quot;Resource&amp;quot;: [
        &amp;quot;arn:aws:s3:::my-lakehouse-bucket&amp;quot;,
        &amp;quot;arn:aws:s3:::my-lakehouse-bucket/*&amp;quot;
      ]
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This cross-account policy allows roles within the analytics account to read, write, and manage objects in the target bucket, provided the role in Account A also has matching IAM permissions.&lt;/p&gt;
&lt;h3&gt;S3 Directory Structure and Prefix Hashing&lt;/h3&gt;
&lt;p&gt;In standard S3 storage, high-throughput write applications can hit request rate limits. S3 supports up to 3,500 PUT/COPY/POST/DELETE requests and 5,500 GET/HEAD requests per second per partition prefix. If your data pipeline writes thousands of small files to a single partition folder, S3 may return HTTP 503 throttling errors.&lt;/p&gt;
&lt;p&gt;To avoid throttling, you should design your S3 directory layout to distribute writes across multiple prefixes. Traditional Hive directories concentrated all writes into a single deep path. In contrast, Iceberg allows you to configure object storage routing to distribute data files across multiple hashed prefixes.&lt;/p&gt;
&lt;p&gt;When object storage routing is enabled, Iceberg generates a hash value (such as a Murmur3 hash of the table and file names) and inserts it as a prefix in the file path. For instance, rather than writing all files to:
&lt;code&gt;s3://my-lakehouse-bucket/analytics/orders/data/order_date=2026-05-22/&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Iceberg can write files to paths that insert a hash value after the bucket root:
&lt;code&gt;s3://my-lakehouse-bucket/a8f9c1d2/analytics/orders/data/&lt;/code&gt;
&lt;code&gt;s3://my-lakehouse-bucket/3b7e8f9a/analytics/orders/data/&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;These hash prefixes instruct S3 to distribute the files across different physical storage partitions behind the scenes. This increases your aggregate throughput capacity, eliminating rate limit bottlenecks.&lt;/p&gt;
&lt;h3&gt;Deep Dive into Object Storage Optimization and S3 Tables&lt;/h3&gt;
&lt;p&gt;Standard S3 buckets are generic object stores that treat all files as isolated blobs. In contrast, Amazon S3 Tables are optimized to host open table formats like Apache Iceberg. S3 Tables natively manage table metadata, automate background maintenance operations, and offer up to a ten-fold increase in transaction rates compared to standard buckets.&lt;/p&gt;
&lt;p&gt;S3 Tables accomplish this optimization by removing the traditional directory simulation overhead. In standard S3, listing prefixes requires indexing large strings of text. S3 Tables organize metadata directly in a physical catalog layer managed by S3. Furthermore, AWS manages automated compaction background tasks for tables stored within S3 Tables, merging small files automatically without needing manual engineering orchestration or external scheduler jobs. This storage class reduces the management overhead of manually maintaining tables while providing high-performance object access.&lt;/p&gt;
&lt;h3&gt;Partitioning Layout Strategies&lt;/h3&gt;
&lt;p&gt;Choosing the right partitioning strategy is crucial to minimize query scanning costs. Iceberg features hidden partitioning, which means query engines automatically determine which partitions to scan based on query filters.&lt;/p&gt;
&lt;p&gt;We will use two standard table schemas for our examples:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;analytics.orders&lt;/code&gt; (fields: &lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;analytics.customers&lt;/code&gt; (fields: &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;signup_date&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For &lt;code&gt;analytics.orders&lt;/code&gt;, partitioning by the day of the &lt;code&gt;order_date&lt;/code&gt; field is highly effective. Iceberg partitions the data internally using a date transform, avoiding the need to maintain a separate physical partition column. For &lt;code&gt;analytics.customers&lt;/code&gt;, partitioning by &lt;code&gt;state&lt;/code&gt; or the month of &lt;code&gt;signup_date&lt;/code&gt; is appropriate, depending on query distribution patterns.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Athena DDL Setup, Parquet Structures, and Hidden Partitioning&lt;/h2&gt;
&lt;p&gt;We will use AWS Athena to create our database and register our Iceberg tables in the AWS Glue Catalog.&lt;/p&gt;
&lt;p&gt;First, we create the logical database. You can run this command directly in the Athena Query Editor:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE DATABASE IF NOT EXISTS analytics;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Parquet Internals and Compression&lt;/h3&gt;
&lt;p&gt;Before writing DDL statements, it is helpful to understand how Parquet storage interacts with Iceberg. Parquet is a columnar storage format that organizes data into Row Groups, Column Chunks, and Pages.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row Groups&lt;/strong&gt;: Horizontal partitions of data within a single file. A typical row group contains between 100 megabytes and 1 gigabyte of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column Chunks&lt;/strong&gt;: Column-specific storage within a row group. Column chunks are read independently, which allows query engines to skip columns that are not referenced in the SQL query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pages&lt;/strong&gt;: The smallest unit of data in Parquet, containing values, repetition levels, and definition levels. Pages are compressed and encoded individually.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By utilizing ZSTD compression, we achieve high compression ratios while retaining fast decompression speeds. ZSTD processes Parquet dictionary encodings and bit-packing arrays efficiently, allowing the CPU to read columns from S3 with minimal CPU cycle overhead.&lt;/p&gt;
&lt;h3&gt;Creating the Customers Table&lt;/h3&gt;
&lt;p&gt;Next, we create the &lt;code&gt;analytics.customers&lt;/code&gt; table. In Athena, you define an Iceberg table by appending &lt;code&gt;TBLPROPERTIES (&apos;table_type&apos;=&apos;ICEBERG&apos;)&lt;/code&gt; to the DDL statement.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE IF NOT EXISTS analytics.customers (
  customer_id STRING,
  name STRING,
  email STRING,
  state STRING,
  signup_date DATE
)
PARTITIONED BY (state)
LOCATION &apos;s3://my-lakehouse-bucket/analytics/customers/&apos;
TBLPROPERTIES (
  &apos;table_type&apos;=&apos;ICEBERG&apos;,
  &apos;format&apos;=&apos;parquet&apos;,
  &apos;write.format.default&apos;=&apos;parquet&apos;,
  &apos;write.parquet.compression-codec&apos;=&apos;zstd&apos;,
  &apos;history.expire.max-snapshot-age-ms&apos;=&apos;604800000&apos;,
  &apos;history.expire.min-snapshots-to-keep&apos;=&apos;5&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let us examine the table properties configured in this DDL statement:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&apos;table_type&apos;=&apos;ICEBERG&apos;&lt;/code&gt;: Instructs Athena to write this table using the Apache Iceberg format rather than standard Glue/Hive format.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&apos;write.format.default&apos;=&apos;parquet&apos;&lt;/code&gt;: Sets Parquet as the default file format for all data writes.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&apos;write.parquet.compression-codec&apos;=&apos;zstd&apos;&lt;/code&gt;: Configures ZSTD compression, which offers an excellent balance between compression ratios and decompression speeds.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&apos;history.expire.max-snapshot-age-ms&apos;=&apos;604800000&apos;&lt;/code&gt;: Sets the maximum snapshot age to seven days. Snapshots older than this limit are marked for expiration to conserve storage.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&apos;history.expire.min-snapshots-to-keep&apos;=&apos;5&apos;&lt;/code&gt;: Guarantees that at least five historical snapshots are retained, ensuring you can perform time travel queries even if data is updated frequently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Creating the Orders Table&lt;/h3&gt;
&lt;p&gt;Now, we create the &lt;code&gt;analytics.orders&lt;/code&gt; table. For this table, we will partition the data using the &lt;code&gt;day&lt;/code&gt; transform on the &lt;code&gt;order_date&lt;/code&gt; column.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE IF NOT EXISTS analytics.orders (
  order_id STRING,
  customer_id STRING,
  order_date DATE,
  status STRING,
  amount DOUBLE
)
PARTITIONED BY (day(order_date))
LOCATION &apos;s3://my-lakehouse-bucket/analytics/orders/&apos;
TBLPROPERTIES (
  &apos;table_type&apos;=&apos;ICEBERG&apos;,
  &apos;format&apos;=&apos;parquet&apos;,
  &apos;write.format.default&apos;=&apos;parquet&apos;,
  &apos;write.parquet.compression-codec&apos;=&apos;zstd&apos;,
  &apos;history.expire.max-snapshot-age-ms&apos;=&apos;604800000&apos;,
  &apos;history.expire.min-snapshots-to-keep&apos;=&apos;5&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Power of Hidden Partitioning&lt;/h3&gt;
&lt;p&gt;By utilizing &lt;code&gt;day(order_date)&lt;/code&gt;, we instruct Iceberg to automatically group records by day. In a legacy Hive table, you had to define a virtual partition column, and queries had to explicitly filter on that column (for example, &lt;code&gt;WHERE order_date_partition = &apos;2026-05-22&apos;&lt;/code&gt;) to avoid scanning the entire dataset. If a developer forgot to include the partition column filter, the query scanned the whole table, resulting in high query costs and slow execution.&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s hidden partitioning decouples physical partitioning from logical query structure. Because Iceberg tracks partition boundaries in its metadata manifest files, a user simply queries the logical table (for instance, &lt;code&gt;WHERE order_date = CAST(&apos;2026-05-22&apos; AS DATE)&lt;/code&gt;). The query engine inspects the manifest files, translates the date filter into partition boundaries, and prunes non-matching files automatically. This guarantees efficient queries without placing the optimization burden on the dashboard designer or application developer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;5. Ingestion Pipelines: SQL DML and PySpark Integration&lt;/h2&gt;
&lt;p&gt;Once the tables are created, we can populate them using SQL INSERT statements or programmatically using PySpark.&lt;/p&gt;
&lt;h3&gt;Populating Tables using Athena SQL&lt;/h3&gt;
&lt;p&gt;Let us load initial customer records into the &lt;code&gt;analytics.customers&lt;/code&gt; table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO analytics.customers VALUES
  (&apos;C001&apos;, &apos;Alice Smith&apos;, &apos;alice@example.com&apos;, &apos;NY&apos;, CAST(&apos;2026-01-15&apos; AS DATE)),
  (&apos;C002&apos;, &apos;Bob Jones&apos;, &apos;bob@example.com&apos;, &apos;CA&apos;, CAST(&apos;2026-02-20&apos; AS DATE)),
  (&apos;C003&apos;, &apos;Charlie Brown&apos;, &apos;charlie@example.com&apos;, &apos;TX&apos;, CAST(&apos;2026-03-10&apos; AS DATE)),
  (&apos;C004&apos;, &apos;Diana Prince&apos;, &apos;diana@example.com&apos;, &apos;NY&apos;, CAST(&apos;2026-04-05&apos; AS DATE)),
  (&apos;C005&apos;, &apos;Evan Wright&apos;, &apos;evan@example.com&apos;, &apos;CA&apos;, CAST(&apos;2026-05-12&apos; AS DATE));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we ingest records into the &lt;code&gt;analytics.orders&lt;/code&gt; table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO analytics.orders VALUES
  (&apos;O101&apos;, &apos;C001&apos;, CAST(&apos;2026-05-20&apos; AS DATE), &apos;COMPLETED&apos;, 150.50),
  (&apos;O102&apos;, &apos;C002&apos;, CAST(&apos;2026-05-20&apos; AS DATE), &apos;PENDING&apos;, 99.99),
  (&apos;O103&apos;, &apos;C001&apos;, CAST(&apos;2026-05-21&apos; AS DATE), &apos;COMPLETED&apos;, 45.00),
  (&apos;O104&apos;, &apos;C003&apos;, CAST(&apos;2026-05-21&apos; AS DATE), &apos;SHIPPED&apos;, 250.00),
  (&apos;O105&apos;, &apos;C004&apos;, CAST(&apos;2026-05-22&apos; AS DATE), &apos;COMPLETED&apos;, 300.00),
  (&apos;O106&apos;, &apos;C002&apos;, CAST(&apos;2026-05-22&apos; AS DATE), &apos;CANCELLED&apos;, 15.75);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When these statements execute, Athena writes Parquet data files to S3, writes a new metadata JSON file, and updates the table pointer in the Glue Catalog.&lt;/p&gt;
&lt;h3&gt;Programmatic Ingest using PySpark&lt;/h3&gt;
&lt;p&gt;In enterprise environments, data is regularly ingested from streaming pipelines or large ETL batch systems using Apache Spark. To write to our Iceberg tables in Glue via Spark, you must configure the Spark session to use the Iceberg Spark runtime catalog, pointing it to the AWS Glue Catalog implementation.&lt;/p&gt;
&lt;p&gt;Below is the PySpark initialization script required to connect Spark to the Glue Catalog:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession

# Initialize Spark Session with Glue Catalog and Iceberg Configurations
spark = SparkSession.builder \
    .appName(&amp;quot;LakehouseIngestionPipeline&amp;quot;) \
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog&amp;quot;, &amp;quot;org.apache.iceberg.spark.SparkCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog.catalog-impl&amp;quot;, &amp;quot;org.apache.iceberg.aws.glue.GlueCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog.warehouse&amp;quot;, &amp;quot;s3://my-lakehouse-bucket/warehouse/&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog.io-impl&amp;quot;, &amp;quot;org.apache.iceberg.aws.s3.S3FileIO&amp;quot;) \
    .getOrCreate()

# Create sample order update dataframe
updated_orders_data = [
    (&amp;quot;O102&amp;quot;, &amp;quot;C002&amp;quot;, &amp;quot;2026-05-20&amp;quot;, &amp;quot;COMPLETED&amp;quot;, 99.99),
    (&amp;quot;O107&amp;quot;, &amp;quot;C005&amp;quot;, &amp;quot;2026-05-22&amp;quot;, &amp;quot;COMPLETED&amp;quot;, 120.00)
]

columns = [&amp;quot;order_id&amp;quot;, &amp;quot;customer_id&amp;quot;, &amp;quot;order_date&amp;quot;, &amp;quot;status&amp;quot;, &amp;quot;amount&amp;quot;]
df_updates = spark.createDataFrame(updated_orders_data, schema=columns)
df_updates = df_updates.withColumn(&amp;quot;order_date&amp;quot;, df_updates[&amp;quot;order_date&amp;quot;].cast(&amp;quot;date&amp;quot;))

# Register update dataframe as a temporary view
df_updates.createOrReplaceTempView(&amp;quot;orders_updates&amp;quot;)

# Execute a MERGE INTO operation using Spark SQL
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    MERGE INTO glue_catalog.analytics.orders t
    USING orders_updates s
    ON t.order_id = s.order_id
    WHEN MATCHED THEN
      UPDATE SET t.status = s.status, t.amount = s.amount
    WHEN NOT MATCHED THEN
      INSERT (order_id, customer_id, order_date, status, amount)
      VALUES (s.order_id, s.customer_id, s.order_date, s.status, s.amount)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This PySpark script uses the &lt;code&gt;GlueCatalog&lt;/code&gt; class to manage table state. The &lt;code&gt;MERGE INTO&lt;/code&gt; operation executes as an atomic transaction. If the merge succeeds, Spark commits the update to Glue, and the changes are instantly visible to all other engines querying the catalog.&lt;/p&gt;
&lt;h3&gt;Glue Catalog Lock Implementations&lt;/h3&gt;
&lt;p&gt;To handle high-concurrency writes, Spark applications must configure catalog locks to prevent two engines from overlapping pointer update requests. Under the hood, Glue catalog connection configuration options allow you to specify the lock implementation. By default, you can configure DynamoDB-based locking or use Glue&apos;s native transactional update API:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Configure DynamoDB-based transactional locking for Glue Catalog
spark.conf.set(&amp;quot;spark.sql.catalog.glue_catalog.lock-impl&amp;quot;, &amp;quot;org.apache.iceberg.aws.glue.DynamoDbLockManager&amp;quot;)
spark.conf.set(&amp;quot;spark.sql.catalog.glue_catalog.lock.table&amp;quot;, &amp;quot;iceberg_lock_table&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;DynamoDbLockManager&lt;/code&gt; creates a DynamoDB table named &lt;code&gt;iceberg_lock_table&lt;/code&gt; to coordinate lock acquisitions. When Spark attempts to swap the table pointer in Glue, it first acquires a row lock in the DynamoDB table, performs the update, and then releases the lock. This prevents collision issues when dozens of spark workers attempt concurrent transactions on the same Iceberg table.&lt;/p&gt;
&lt;h3&gt;Running Analytical Queries&lt;/h3&gt;
&lt;p&gt;We can run complex SQL queries that join these tables to generate customer purchase summaries. For instance, the following query computes the total amount spent by customers in each state:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Calculate total order amount by state */
SELECT
  c.state,
  COUNT(o.order_id) AS total_orders,
  SUM(o.amount) AS total_revenue
FROM analytics.orders o
JOIN analytics.customers c ON o.customer_id = c.customer_id
WHERE o.status != &apos;CANCELLED&apos;
GROUP BY c.state
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Athena reads the manifest files for both tables to locate the exact Parquet files that correspond to active records. It then reads only the required columns (&lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, and &lt;code&gt;status&lt;/code&gt;), ignoring unrelated fields like email addresses. This column pruning reduces the volume of data read from S3, speeding up queries and lowering scanning costs.&lt;/p&gt;
&lt;h3&gt;Executing Time Travel Queries&lt;/h3&gt;
&lt;p&gt;Because Iceberg maintains a history of snapshots, we can query previous states of a table. Suppose we update a record in the &lt;code&gt;analytics.orders&lt;/code&gt; table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE analytics.orders
SET status = &apos;COMPLETED&apos;
WHERE order_id = &apos;O102&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can query the current state of the table to confirm that the update succeeded:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* View current status of order O102 */
SELECT order_id, status
FROM analytics.orders
WHERE order_id = &apos;O102&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To see what the order looked like before the update, we can query a previous snapshot. In Athena, you can view the snapshot history of an Iceberg table using the system metadata tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Retrieve snapshot history for the orders table */
SELECT snapshot_id, committed_at, parent_id, operation
FROM &amp;quot;analytics&amp;quot;.&amp;quot;orders$snapshots&amp;quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we identify the snapshot ID that corresponds to the initial state, we can query that snapshot directly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Query the orders table as of a specific snapshot */
SELECT order_id, status
FROM analytics.orders FOR SYSTEM_VERSION_AS_OF 1234567890123456789
WHERE order_id = &apos;O102&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replace &lt;code&gt;1234567890123456789&lt;/code&gt; with the actual snapshot ID from your snapshot history metadata query. The query returns &lt;code&gt;&apos;PENDING&apos;&lt;/code&gt;, demonstrating that Iceberg can access historical states of the dataset without requiring you to maintain complex manual backups.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Accelerating Queries in Dremio&lt;/h2&gt;
&lt;p&gt;AWS Athena is an outstanding serverless engine for ad-hoc queries, but it is not designed to support high-concurrency, sub-second applications like real-time BI dashboards. Athena queries typically take several seconds to plan and execute due to serverless cold starts, catalog lookup overhead, and S3 latency.&lt;/p&gt;
&lt;p&gt;To achieve sub-second execution speeds, organizations integrate a Dremio engine with their AWS Glue Catalog and S3 storage.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                                  +-----------------------+
                                  |    User SQL Query     |
                                  +-----------+-----------+
                                              |
                                              v
                                  +-----------------------+
                                  |   Dremio Coordinator  |
                                  |  (Local Metadata Cache|
                                  |   &amp;amp; Calcite Planner)  |
                                  +-----------+-----------+
                                              |
                                     (Query Rewrite Match?)
                                     /                     \
                                   Yes                      No
                                   /                         \
                                  v                           v
                      +----------------------+     +----------------------+
                      |   Data Reflections   |     |    Read Base Table   |
                      |  (Pre-aggregated /   |     |     via Arrow        |
                      |   Pre-computed Join) |     |  Vectorized Engine   |
                      +----------------------+     +----------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Connecting Dremio to the Glue Catalog&lt;/h3&gt;
&lt;p&gt;Dremio provides a native connector for the AWS Glue Catalog. To connect Dremio:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Open the Dremio administrator console.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Source&lt;/strong&gt; in the bottom-left corner and select &lt;strong&gt;AWS Glue Catalog&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter a name for the source (for example, &lt;code&gt;glue_catalog&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Configure the authentication method. You can use AWS Access Keys or configure Dremio to assume an IAM Role.&lt;/li&gt;
&lt;li&gt;Specify the AWS Region where your Glue Catalog resides (for example, &lt;code&gt;us-east-1&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Under the S3 storage configuration, provide your S3 bucket path (&lt;code&gt;s3://my-lakehouse-bucket/&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Save the configuration.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Once connected, Dremio scans the Glue Catalog and displays the &lt;code&gt;analytics&lt;/code&gt; database along with the &lt;code&gt;orders&lt;/code&gt; and &lt;code&gt;customers&lt;/code&gt; tables in its workspace tree. You can immediately run high-performance queries across these tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Join query executed in Dremio */
SELECT
  c.name,
  o.order_date,
  o.amount
FROM glue_catalog.analytics.orders o
JOIN glue_catalog.analytics.customers c ON o.customer_id = c.customer_id
WHERE o.status = &apos;COMPLETED&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Why the Dremio Engine is Faster&lt;/h3&gt;
&lt;p&gt;The Dremio engine uses several architectural optimizations to execute queries faster than standard query engines.&lt;/p&gt;
&lt;h4&gt;1. Apache Arrow Vectorized Execution&lt;/h4&gt;
&lt;p&gt;The Dremio engine processes data using Apache Arrow as its internal memory format. Apache Arrow organizes data in memory column-by-column rather than row-by-row. When Dremio reads Parquet files from S3, it loads the column arrays directly into memory without performing expensive row-to-column serialization and deserialization.&lt;/p&gt;
&lt;p&gt;By executing query operations directly on memory column arrays, Dremio maximizes CPU cache efficiency and utilizes SIMD (Single Instruction, Multiple Data) instructions to process multiple data values in parallel.&lt;/p&gt;
&lt;h4&gt;2. Local Coordinator Metadata Caching&lt;/h4&gt;
&lt;p&gt;When a query engine plans a query, it must retrieve the table&apos;s schema and locate the physical data files. For Iceberg, this requires reading metadata JSON files, manifest lists, and manifest files. Doing this for every query adds latency, especially when communicating with remote catalogs and object stores.&lt;/p&gt;
&lt;p&gt;The Dremio engine solves this by caching Iceberg metadata on its local coordinator nodes. When a new query is submitted, Dremio checks the Glue Catalog to see if the table&apos;s current metadata pointer has changed. If the pointer has not changed, Dremio plans the query using the cached metadata, bypassing S3 network requests. This local caching reduces planning times from seconds to milliseconds.&lt;/p&gt;
&lt;h4&gt;3. Positional Delete File Caching&lt;/h4&gt;
&lt;p&gt;In Iceberg tables that use the Merge-on-Read (MoR) write strategy, updates and deletes are written to separate delete files rather than rewriting the base Parquet files. When reading the table, query engines must merge these delete files with the base files to filter out deleted rows. Loading and parsing delete files for every query scan adds substantial overhead.&lt;/p&gt;
&lt;p&gt;Dremio accelerates this process by caching positional delete files in memory. Rather than reading the delete files from S3 for every query, the engine maintains an active cache of deleted row indexes, applying them to base data scans at memory speed.&lt;/p&gt;
&lt;h4&gt;4. Data Reflections and the Apache Calcite Optimizer&lt;/h4&gt;
&lt;p&gt;The most powerful acceleration feature in Dremio is Data Reflections. Reflections are pre-computed physical layouts of tables or joins that are stored as optimized Parquet files on S3. They are similar to materialized views, but with a critical difference: users do not query Reflections directly. Instead, they query the logical tables, and the Dremio optimizer automatically rewrites the query to use the Reflection.&lt;/p&gt;
&lt;p&gt;Dremio uses Apache Calcite to parse incoming SQL queries into logical algebra trees. The optimizer then applies algebraic transformation rules to determine if a query can be satisfied by reading an active Reflection.&lt;/p&gt;
&lt;p&gt;Calcite&apos;s query rewriter performs advanced tree matching. It matches projections, selection filters, and aggregations. Even if a user query does not exactly match the Reflection structure (for instance, if the query requests a subset of the fields or applies a filter that can be evaluated on top of the Reflection&apos;s data), Calcite rewrites the execution plan to use the Reflection.&lt;/p&gt;
&lt;p&gt;For example, we can create an Aggregation Reflection on our joined orders and customers dataset:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Create an aggregation reflection for order analysis */
ALTER TABLE glue_catalog.analytics.orders
ADD REFLECTION state_revenue_summary
USING AGGREGATION
DIMENSIONS (customer_id, order_date, status)
MEASURES (amount);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When a user executes the query to calculate revenue by state:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT c.state, SUM(o.amount)
FROM glue_catalog.analytics.orders o
JOIN glue_catalog.analytics.customers c ON o.customer_id = c.customer_id
GROUP BY c.state;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Dremio optimizer analyzes the query plan, matches it against the &lt;code&gt;state_revenue_summary&lt;/code&gt; Reflection, and rewrites the query execution plan to read the pre-computed summary. This avoids scanning millions of raw rows, returning the result in milliseconds.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;7. Operational Best Practices and Compaction Mechanics&lt;/h2&gt;
&lt;p&gt;To maintain a healthy open lakehouse on AWS, you should implement the following operational patterns.&lt;/p&gt;
&lt;h3&gt;The Small Files Problem in Detail&lt;/h3&gt;
&lt;p&gt;As new records are added to Iceberg tables via streaming ingest or frequent small batch jobs, the number of small files on S3 can multiply rapidly. This is known as the &amp;quot;small files problem.&amp;quot; A query engine reading a table with thousands of tiny files spends more time opening and closing S3 files than reading data.&lt;/p&gt;
&lt;p&gt;In S3, each GET request introduces a small connection setup latency. If a query scans 10,000 files of 10 KB each, it must perform 10,000 GET requests, resulting in substantial network delay. If those same records are compacted into a single 100 MB Parquet file, the engine makes a single GET request, reading the data at maximum network speed.&lt;/p&gt;
&lt;h3&gt;Compaction Execution&lt;/h3&gt;
&lt;p&gt;You should configure automated compaction routines using Spark or Athena to merge small files into larger, optimized Parquet files (typically 128 MB to 512 MB). Athena allows you to run compaction on Iceberg tables using the &lt;code&gt;OPTIMIZE&lt;/code&gt; command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Run bin-packing compaction on the orders table */
OPTIMIZE analytics.orders WRITE_PROPERTIES (&apos;vacuum_max_metadata_files_to_keep&apos;=&apos;10&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This merges the small Parquet files in active partitions into larger files, improving read speeds.&lt;/p&gt;
&lt;p&gt;For large enterprise datasets, you can perform more advanced compaction routines using Spark SQL procedures. The &lt;code&gt;rewrite_data_files&lt;/code&gt; procedure allows you to configure sort strategies, such as Z-Ordering, to group related data spatially:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Execute Spark compaction with Z-Ordering
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.rewrite_data_files(
      table =&amp;gt; &apos;analytics.orders&apos;,
      strategy =&amp;gt; &apos;sort&apos;,
      sort_order =&amp;gt; &apos;customer_id, order_date&apos;
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This procedure reorganizes files on S3 so that records with similar customer IDs and order dates are stored in the same Parquet files, maximizing row group skipping effectiveness.&lt;/p&gt;
&lt;h3&gt;Expiring Snapshots&lt;/h3&gt;
&lt;p&gt;While Iceberg&apos;s snapshot history is valuable for time travel and auditing, retaining every snapshot indefinitely increases your storage costs. Every snapshot references data files that may have been deleted or updated.&lt;/p&gt;
&lt;p&gt;To prevent storage bloat, you must regularly run snapshot expiration. Athena provides procedures to expire historical snapshots:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Expire snapshots older than seven days */
ALTER TABLE analytics.orders EXECUTE EXPIRE_SNAPSHOTS(CAST(current_date - interval &apos;7&apos; day AS TIMESTAMP));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command deletes older metadata snapshots and permanently removes unreferenced data files from S3, lowering storage costs.&lt;/p&gt;
&lt;h3&gt;Monitoring S3 Request Rates&lt;/h3&gt;
&lt;p&gt;Even with hashed prefixes, you should monitor your bucket metrics in Amazon CloudWatch. Track &lt;code&gt;5xx&lt;/code&gt; error rates and S3 request statistics. If you see elevated &lt;code&gt;503 Throttling&lt;/code&gt; errors, check that your partitioning strategy is not grouping too many concurrent writes into a single folder, and ensure that Iceberg&apos;s object storage routing features are enabled.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;8. Summary&lt;/h2&gt;
&lt;p&gt;Building an open lakehouse on AWS using Apache Iceberg, the AWS Glue Catalog, and S3 provides a reliable, cost-efficient, and scalable foundation for enterprise data platforms. By separating computing engines from storage, you can select the best tool for every query workload:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;AWS Athena&lt;/strong&gt; for serverless, ad-hoc queries, automated data transformations, and exploratory data analysis.&lt;/li&gt;
&lt;li&gt;Use the &lt;strong&gt;Dremio engine&lt;/strong&gt; to deliver sub-second interactive query performance for BI dashboards and high-concurrency applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Through features like Apache Arrow vectorized execution, local metadata coordinator caching, and Apache Calcite-powered Data Reflections, Dremio eliminates object storage latency, allowing you to run interactive analytical queries directly on your open data lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg Catalogs Explained: REST, Glue, Hive Metastore, Polaris, Nessie, and Snowflake</title><link>https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-catalogs-explained/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-catalogs-explained/</guid><description>
In a modern database architecture, data files are typically managed by a monolithic server that controls storage, query planning, metadata, and secur...</description><pubDate>Fri, 22 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;In a modern database architecture, data files are typically managed by a monolithic server that controls storage, query planning, metadata, and security. However, the modern open lakehouse paradigm decouples query computing engines from physical storage. Organizations regularly query the same physical dataset using multiple engines, such as Apache Spark for heavy ETL transformation, Trino for interactive dashboard queries, AWS Athena for ad-hoc queries, and Dremio for high performance analytics.&lt;/p&gt;
&lt;p&gt;Decoupling storage and compute introduces a fundamental coordinator problem. If multiple computing engines are querying and writing to the same set of Parquet files in an object store like Amazon S3, how do they agree on the current state of a table? How do they prevent concurrent writes from corrupting data? How do they track historical snapshots for time travel queries without expensive directory list operations?&lt;/p&gt;
&lt;p&gt;Apache Iceberg solves these issues by maintaining a hierarchical tree of metadata. The root of this metadata structure is the table metadata file, which is a JSON document containing the schema, partition rules, and snapshot histories of the table. However, the query engines still need a mechanism to find the location of the active metadata JSON file. Because cloud object stores do not support atomic file replacement or native locking, engines cannot safely write metadata files concurrently. This coordinator role is fulfilled by the Apache Iceberg catalog.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. The Historical Context: The Limitations of Legacy Catalogs&lt;/h2&gt;
&lt;p&gt;To understand the value of modern Apache Iceberg catalogs, it is necessary to examine how older data lake systems managed table metadata. The legacy standard for data lake catalogs was the Hive Metastore, which was originally developed for Apache Hive.&lt;/p&gt;
&lt;h3&gt;How the Hive Metastore Tracked Tables&lt;/h3&gt;
&lt;p&gt;In a Hive-style table, the database catalog did not track individual data files or transactional snapshots. Instead, it tracked tables and partitions as physical directories in a file system. For example, if you had a table named &lt;code&gt;analytics.orders&lt;/code&gt; partitioned by order date, the catalog would store a mapping in a relational database, such as MySQL or PostgreSQL, indicating that the table was located at &lt;code&gt;hdfs:///user/hive/warehouse/analytics.orders&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;When new data was added to a partition, files were written directly to the directory corresponding to that partition, for instance, &lt;code&gt;hdfs:///user/hive/warehouse/analytics.orders/order_date=2026-05-22/&lt;/code&gt;. The query engine determined which files were part of the table by listing all the files in that directory.&lt;/p&gt;
&lt;h3&gt;HDFS vs Cloud Object Storage: The Technical Disconnect&lt;/h3&gt;
&lt;p&gt;The Hive Metastore design was optimized for the Hadoop Distributed File System (HDFS). On HDFS, folder structures are true physical directories managed by an active NameNode. The NameNode acts as an in-memory transactional coordinator that manages block allocations and directory paths. Consequently, listing files in a folder or renaming a directory are fast metadata operations because they are executed as single memory state swaps in the NameNode.&lt;/p&gt;
&lt;p&gt;When data systems migrated to cloud object storage like Amazon S3, Google Cloud Storage, or Azure Data Lake Storage, this directory-based tracking model encountered serious technical issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prefix Scans and Rate Limits:&lt;/strong&gt; S3 is not a hierarchical file system; it is a key-value store. S3 paths like &lt;code&gt;s3://bucket/folder/file.parquet&lt;/code&gt; are flat string keys. To simulate directory listing, query engines must execute prefix scans. S3 restricts prefix scans (typically to 1,000 keys per request). Querying a table with thousands of partitions requires hundreds of sequential API calls. This network overhead degrades performance, causing query planning to consume substantial CPU cycles before actual data reading begins.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Directory Rename Bottleneck:&lt;/strong&gt; Renaming a directory on a local system is a metadata pointer swap. In cloud object storage, there is no directory pointer. Renaming a directory requires copying every object to a new key and then deleting the old key. If an engine attempts to rename a directory to commit a transaction and the network drops midway, the table is left in an inconsistent state, with some data residing under the old prefix and some under the new prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No Transaction Isolation:&lt;/strong&gt; Since legacy systems rely on file system visibility, any file written to a partition folder is instantly read by active queries. If an ETL job is appending files while a dashboard is running, the dashboard query reads partial data, leading to inconsistent analytics results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg eliminates these limitations by shifting the physical file tracking from the directory level to the file level. Rather than listing prefixes, query engines read explicit lists of files stored in Iceberg&apos;s metadata JSON documents. The catalog&apos;s role is simplified: it no longer tracks directories; it only tracks the current location of the root table metadata JSON file.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. The Core Functions of the Iceberg Catalog&lt;/h2&gt;
&lt;p&gt;The Apache Iceberg catalog is the single source of truth for the physical location of an Iceberg table. In a decoupled open lakehouse, it serves three critical architectural functions: state management, atomic transaction commits, and multi-engine coordination.&lt;/p&gt;
&lt;h3&gt;Central State Manager&lt;/h3&gt;
&lt;p&gt;The catalog functions as a lookup service. An Iceberg table is identified by a logical path, such as &lt;code&gt;analytics.orders&lt;/code&gt;. The catalog maps this path to the URI of the current table metadata JSON file.&lt;/p&gt;
&lt;p&gt;When a query engine starts planning a query, it contacts the catalog and requests this metadata URI. The engine then reads the JSON file from storage and extracts the location of manifest lists and data files. Because it relies on the catalog for metadata pointers, the engine does not perform prefix scans, ensuring fast query planning regardless of table size.&lt;/p&gt;
&lt;h3&gt;Locking Coordinator and Transaction Commit&lt;/h3&gt;
&lt;p&gt;The catalog is critical during write transactions. To commit a transaction, an engine must update the table pointer from the old metadata JSON file to a new metadata JSON file. If two engines try to write to the same table concurrently, they must perform this pointer swap atomically.&lt;/p&gt;
&lt;p&gt;The catalog acts as the locking coordinator to guarantee atomicity. It ensures that only one write succeeds. The pointer swap must be an all-or-nothing operation. If the pointer swap is successful, the transaction is committed, and the new snapshot becomes visible to downstream readers. If another engine has successfully committed a transaction in the millisecond interval between the reader&apos;s check and the write commit, the catalog rejects the second swap. The catalog returns a conflict error, prompting the second engine to reload the updated metadata, resolve conflicts, and try the commit again.&lt;/p&gt;
&lt;h3&gt;Multi-Engine Coordination&lt;/h3&gt;
&lt;p&gt;In an open lakehouse, multiple engines use different programming environments and platforms. For instance, a Spark engine written in Java and a PyIceberg tool written in Python must be able to read and write to the same tables. The catalog acts as a translator, allowing these diverse clients to communicate with a unified system. It translates logical table names like &lt;code&gt;analytics.customers&lt;/code&gt; into the physical directories where files reside, ensuring consistent access across different platforms.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;3. Deep Dive into Catalog Implementations&lt;/h2&gt;
&lt;p&gt;There are several types of Iceberg catalogs. The choice of catalog determines how tables are locked, which engines can write to them, and how metadata is stored. Let us review the principal options in detail.&lt;/p&gt;
&lt;h3&gt;The REST Catalog (The Open Standard)&lt;/h3&gt;
&lt;p&gt;The Apache Iceberg REST Catalog is the standard protocol for catalog operations. Unlike other catalogs that depend on specific client libraries, the REST catalog defines a standard set of JSON payloads and HTTP endpoints.&lt;/p&gt;
&lt;p&gt;Instead of writing database-specific drivers, engines use standard HTTP clients to interact with a REST catalog server. The server manages the backend storage database and coordinates the lock. This model isolates engines from database configuration, security credentials, and storage location details.&lt;/p&gt;
&lt;h3&gt;AWS Glue Catalog&lt;/h3&gt;
&lt;p&gt;The AWS Glue Catalog is a managed metadata store provided by AWS. It is a common choice for architectures deployed entirely on Amazon Web Services. When using AWS Glue, Iceberg uses Glue&apos;s catalog API to store the location of the metadata JSON file as a table parameter.&lt;/p&gt;
&lt;p&gt;Glue handles transaction coordination using DynamoDB or internal transaction locks during pointer updates. The principal benefit of AWS Glue is that it is a serverless, zero maintenance service integrated with AWS IAM, Amazon Athena, and AWS Glue ETL jobs. However, accessing Glue outside of AWS requires configuring IAM credentials, and API rate limits can become an issue when executing thousands of concurrent writes.&lt;/p&gt;
&lt;h3&gt;Project Nessie (Git-for-Data)&lt;/h3&gt;
&lt;p&gt;Project Nessie is an open source transactional catalog designed for lakehouses. It brings Git-like version control concepts to data tables. Nessie allows users to create branches, merge changes, tag specific commits, and roll back tables to historical configurations.&lt;/p&gt;
&lt;p&gt;Nessie achieves this by tracking catalog references in a versioned key-value store, such as PostgreSQL or RocksDB. When you commit a transaction, Nessie records a commit in its commit tree, pointing to the new table metadata JSON. This architecture enables multi-table transactions. For example, you can write changes to &lt;code&gt;analytics.orders&lt;/code&gt; and &lt;code&gt;analytics.customers&lt;/code&gt; inside a &lt;code&gt;staging&lt;/code&gt; branch, and then merge the branch into &lt;code&gt;main&lt;/code&gt; in a single transaction.&lt;/p&gt;
&lt;h3&gt;Polaris Catalog&lt;/h3&gt;
&lt;p&gt;Polaris is an open source REST catalog framework designed for multi-engine metadata management. Built on the Apache Iceberg REST specification, Polaris provides fine-grained role-based access control (RBAC) and credential vending across multiple clouds.&lt;/p&gt;
&lt;p&gt;Polaris separates catalog administration from database access. It allows data platform administrators to define access policies once in Polaris, and then apply those rules across Spark, Snowflake, and Dremio. Because it implements the REST spec, Polaris works with any client engine that supports the REST catalog standard.&lt;/p&gt;
&lt;h3&gt;Snowflake Catalog Integration&lt;/h3&gt;
&lt;p&gt;Snowflake allows users to create Iceberg tables where Snowflake acts as either the table writer or a read-only reader. When Snowflake manages the catalog, it handles metadata generation and pointer swaps internally.&lt;/p&gt;
&lt;p&gt;Downstream engines like Spark can query Snowflake managed Iceberg tables by reading from Snowflake&apos;s external catalog sync service. This integration is useful for organizations that use Snowflake for data warehousing but want to preserve open file access for external computing systems.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Deep Dive into the Iceberg REST Catalog API Specification&lt;/h2&gt;
&lt;p&gt;To appreciate the design of the REST Catalog, we must look at the specific API interactions defined by the Apache Iceberg REST specification. When an engine connects to a REST catalog, it utilizes a set of standard REST resource endpoints.&lt;/p&gt;
&lt;h3&gt;1. The Config Service&lt;/h3&gt;
&lt;p&gt;Before the engine performs any table operations, it sends a config request to the server:
&lt;code&gt;GET /v1/config&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The response payload is a JSON document containing catalog properties. This config bootstrap allows the server to send runtime properties to the client, such as the active warehouse path and token refresh configurations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;defaults&amp;quot;: {
    &amp;quot;clients.token-refresh-enabled&amp;quot;: &amp;quot;true&amp;quot;,
    &amp;quot;warehouse&amp;quot;: &amp;quot;s3://my-shared-lakehouse-bucket/&amp;quot;
  },
  &amp;quot;overrides&amp;quot;: {
    &amp;quot;compatibility.strict-mode&amp;quot;: &amp;quot;false&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Resolving Table Locations&lt;/h3&gt;
&lt;p&gt;When a query planner needs to resolve a table name like &lt;code&gt;analytics.orders&lt;/code&gt;, it executes a get request to the table endpoint:
&lt;code&gt;GET /v1/namespaces/analytics/tables/orders&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The catalog server responds with a JSON payload detailing the complete state of the table, including schema fields, partition specs, and the exact URI of the current metadata file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;metadata-location&amp;quot;: &amp;quot;s3://my-shared-lakehouse-bucket/analytics/orders/metadata/v4.metadata.json&amp;quot;,
  &amp;quot;metadata&amp;quot;: {
    &amp;quot;format-version&amp;quot;: 2,
    &amp;quot;table-uuid&amp;quot;: &amp;quot;a8934b5c-89fd-4d2d-90c1-38290f847291&amp;quot;,
    &amp;quot;location&amp;quot;: &amp;quot;s3://my-shared-lakehouse-bucket/analytics/orders&amp;quot;,
    &amp;quot;last-sequence-number&amp;quot;: 12,
    &amp;quot;last-updated-ms&amp;quot;: 1716382800000,
    &amp;quot;last-column-id&amp;quot;: 5,
    &amp;quot;current-schema-id&amp;quot;: 0,
    &amp;quot;schemas&amp;quot;: [
      {
        &amp;quot;type&amp;quot;: &amp;quot;struct&amp;quot;,
        &amp;quot;schema-id&amp;quot;: 0,
        &amp;quot;fields&amp;quot;: [
          { &amp;quot;id&amp;quot;: 1, &amp;quot;name&amp;quot;: &amp;quot;order_id&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
          {
            &amp;quot;id&amp;quot;: 2,
            &amp;quot;name&amp;quot;: &amp;quot;customer_id&amp;quot;,
            &amp;quot;required&amp;quot;: true,
            &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;
          },
          { &amp;quot;id&amp;quot;: 3, &amp;quot;name&amp;quot;: &amp;quot;order_date&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;date&amp;quot; },
          { &amp;quot;id&amp;quot;: 4, &amp;quot;name&amp;quot;: &amp;quot;status&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
          { &amp;quot;id&amp;quot;: 5, &amp;quot;name&amp;quot;: &amp;quot;amount&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;double&amp;quot; }
        ]
      }
    ],
    &amp;quot;default-spec-id&amp;quot;: 0,
    &amp;quot;partition-specs&amp;quot;: [
      {
        &amp;quot;spec-id&amp;quot;: 0,
        &amp;quot;fields&amp;quot;: [
          {
            &amp;quot;name&amp;quot;: &amp;quot;order_date_day&amp;quot;,
            &amp;quot;transform&amp;quot;: &amp;quot;day&amp;quot;,
            &amp;quot;source-id&amp;quot;: 3,
            &amp;quot;field-id&amp;quot;: 1000
          }
        ]
      }
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. The Commit Protocol&lt;/h3&gt;
&lt;p&gt;During a write transaction, the query engine writes data files to storage, generates a new table metadata JSON file, and then attempts a pointer swap by sending a post request:
&lt;code&gt;POST /v1/namespaces/analytics/tables/orders&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The request body contains the old metadata location (the base state) and the new metadata location:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;requirements&amp;quot;: [
    {
      &amp;quot;type&amp;quot;: &amp;quot;assert-metadata-location&amp;quot;,
      &amp;quot;metadata-location&amp;quot;: &amp;quot;s3://my-shared-lakehouse-bucket/analytics/orders/metadata/v4.metadata.json&amp;quot;
    }
  ],
  &amp;quot;updates&amp;quot;: [
    {
      &amp;quot;action&amp;quot;: &amp;quot;upgrade-format-version&amp;quot;,
      &amp;quot;format-version&amp;quot;: 2
    },
    {
      &amp;quot;action&amp;quot;: &amp;quot;add-snapshot&amp;quot;,
      &amp;quot;snapshot&amp;quot;: {
        &amp;quot;snapshot-id&amp;quot;: 8027658604211071520,
        &amp;quot;timestamp-ms&amp;quot;: 1716382900000,
        &amp;quot;summary&amp;quot;: {
          &amp;quot;operation&amp;quot;: &amp;quot;append&amp;quot;,
          &amp;quot;spark.app.id&amp;quot;: &amp;quot;app-20260522&amp;quot;
        },
        &amp;quot;manifest-list&amp;quot;: &amp;quot;s3://my-shared-lakehouse-bucket/analytics/orders/metadata/snap-8027658604211071520.avro&amp;quot;
      }
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The REST server validates that the current metadata location of the table matches the requirement. If it matches, the server updates its internal state pointer to the new location and returns 200 OK. If a concurrent write has updated the table location, the assertion fails, and the server returns a 409 Conflict.&lt;/p&gt;
&lt;h3&gt;Credential Vending: Securing the Storage Layer&lt;/h3&gt;
&lt;p&gt;One of the most powerful features of the REST Catalog spec is &lt;strong&gt;credential vending&lt;/strong&gt;. In a traditional lakehouse environment, every query engine must have direct read and write access to the cloud storage bucket (such as S3 or ADLS) containing the raw data files. This requirement complicates security, as it forces administrators to manage broad IAM roles or access keys across multiple computing platforms.&lt;/p&gt;
&lt;p&gt;With credential vending, client engines do not need pre-configured storage keys. Instead, the process works as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The engine requests the catalog to load a table: &lt;code&gt;GET /v1/namespaces/analytics/tables/orders&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The REST Catalog server verifies that the user&apos;s role has permission to access the table.&lt;/li&gt;
&lt;li&gt;The server contacts the cloud provider&apos;s STS (Security Token Service) to generate short-lived, restricted storage credentials.&lt;/li&gt;
&lt;li&gt;The server returns these temporary credentials to the engine in the table metadata response payload.&lt;/li&gt;
&lt;li&gt;The engine uses the temporary credentials to read or write the physical Parquet data files directly from the object store.&lt;/li&gt;
&lt;li&gt;Once the session ends, the temporary credentials expire, securing the storage layer from unauthorized access.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Standard REST Catalog Error Handling&lt;/h3&gt;
&lt;p&gt;The Iceberg REST Catalog specification defines structured error JSON responses so client libraries can handle failures deterministically. Standard error formats include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;400 Bad Request:&lt;/strong&gt; Sent when a query parameter or request body is malformed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;401 Unauthorized:&lt;/strong&gt; Sent when the OAuth2 authorization token is invalid, expired, or missing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;404 Not Found:&lt;/strong&gt; Returned when a requested table or namespace does not exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;409 Conflict:&lt;/strong&gt; Sent during a table commit transaction when the target pointer has changed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;500 Internal Server Error:&lt;/strong&gt; Used for unhandled system exceptions in the catalog backend.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;5. Branching and Commits: Nessie and Polaris Architecture&lt;/h2&gt;
&lt;h3&gt;Project Nessie Versioned Key-Value Layout&lt;/h3&gt;
&lt;p&gt;Project Nessie is structured differently from traditional catalogs. While a typical REST catalog stores a simple database table containing mapping records, Nessie maintains a complete version graph.&lt;/p&gt;
&lt;p&gt;Nessie stores its commit log in a database like PostgreSQL or RocksDB. The commit log records commits as nodes containing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A unique hash identifier.&lt;/li&gt;
&lt;li&gt;A parent commit hash reference.&lt;/li&gt;
&lt;li&gt;A map of active table paths and their associated metadata JSON URIs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This key-value layout allows Nessie to perform Git-like operations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Zero-Copy Branching:&lt;/strong&gt; Creating a branch is a metadata operation that registers a new name pointing to an existing commit hash in the database.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Isolated Writes:&lt;/strong&gt; When an engine writes to a branch, the write commits a new node on that branch&apos;s path. Other branches remain unaffected, isolating the write.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Merge Operations:&lt;/strong&gt; Merging updates a target branch&apos;s head to point to the commit hash of the source branch. If both branches modified the same table concurrently, Nessie rejects the merge, requesting conflict resolution.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Polaris Security and Administration Architecture&lt;/h3&gt;
&lt;p&gt;The Polaris catalog implements the Iceberg REST spec but adds administrative structures to manage multiple independent catalogs and cloud environments.&lt;/p&gt;
&lt;p&gt;Administrators configure Polaris using its management API. The administrative layer is structured around three entities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalogs:&lt;/strong&gt; Named spaces representing distinct metadata scopes, such as a catalog for production and a catalog for testing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Principals:&lt;/strong&gt; Credentials representing client applications, Spark jobs, or query engines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Roles:&lt;/strong&gt; Logical mappings containing specific access privileges (e.g. write access to &lt;code&gt;analytics&lt;/code&gt;, read-only access to &lt;code&gt;customers&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This separation ensures that security teams can manage permissions in Polaris, while query engines connect using standard REST clients, unaware of the underlying security layout.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Locking Strategies: Relational Databases vs Serverless Catalogs&lt;/h2&gt;
&lt;p&gt;The mechanism a catalog uses to handle table locks determines its reliability and scale limits. Let us analyze the locking strategies implemented across different systems.&lt;/p&gt;
&lt;h3&gt;Relational Database Locks (REST, Hive)&lt;/h3&gt;
&lt;p&gt;For REST catalogs backed by relational databases (like PostgreSQL), transaction atomicity is achieved using database transactions. When an engine commits a pointer update, the REST server executes a SQL statement:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Execute a table write lock within PostgreSQL */
SELECT metadata_location
FROM table_metadata
WHERE table_name = &apos;analytics.orders&apos;
FOR UPDATE;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query blocks other database transactions from updating the same row. The catalog server verifies that the metadata location matches, writes the new metadata row, and commits the transaction, releasing the lock. While reliable, this model can block connections under high write concurrency, as transactions wait for database locks.&lt;/p&gt;
&lt;h3&gt;DynamoDB Locking (Glue)&lt;/h3&gt;
&lt;p&gt;AWS Glue Catalog relies on DynamoDB or internal locks to manage table pointer swaps. During a commit, Glue uses DynamoDB&apos;s optimistic concurrency control:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Glue reads the table record containing the current metadata URI.&lt;/li&gt;
&lt;li&gt;Glue writes the new metadata URI using a conditional expression, verifying that the table&apos;s version tag matches the read value.&lt;/li&gt;
&lt;li&gt;If the version matches, DynamoDB executes the write and increments the version tag.&lt;/li&gt;
&lt;li&gt;If another write has changed the version tag, DynamoDB rejects the update, causing Glue to return a commit conflict error.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This optimistic model performs well under moderate write concurrency, but high conflict rates can degrade performance, as clients repeatedly fail conditional writes and retry.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;7. Cross-Engine Catalog Configuration&lt;/h2&gt;
&lt;p&gt;To build an open lakehouse, you must configure multiple query engines to share a single catalog. This setup ensures that if a PySpark job writes to a table, a Dremio query can access the data instantly.&lt;/p&gt;
&lt;p&gt;Let us walk through configuring a shared REST Catalog across a PySpark pipeline and a Dremio engine, using our standard schemas.&lt;/p&gt;
&lt;h3&gt;PySpark Catalog Configuration&lt;/h3&gt;
&lt;p&gt;To configure PySpark to connect to a shared REST catalog, we pass the catalog class, URI, credentials, and warehouse parameters to the Spark configuration.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession

/* Define Spark Session with REST Catalog properties */
spark = SparkSession.builder \
    .appName(&amp;quot;SharedRESTCatalogSetup&amp;quot;) \
    .config(&amp;quot;spark.jars.packages&amp;quot;, &amp;quot;org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2&amp;quot;) \
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog&amp;quot;, &amp;quot;org.apache.iceberg.spark.SparkCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.type&amp;quot;, &amp;quot;rest&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.uri&amp;quot;, &amp;quot;http://rest-catalog-server:8181&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.credential&amp;quot;, &amp;quot;client_id:client_secret&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.warehouse&amp;quot;, &amp;quot;s3://my-shared-lakehouse-bucket/&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.rest_catalog.io-impl&amp;quot;, &amp;quot;org.apache.iceberg.aws.s3.S3FileIO&amp;quot;) \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once configured, we can initialize our standard tables using Spark SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Create analytics namespace under our rest_catalog */
CREATE NAMESPACE IF NOT EXISTS rest_catalog.analytics;

/* Create the orders table in the REST catalog */
CREATE TABLE IF NOT EXISTS rest_catalog.analytics.orders (
    order_id STRING,
    customer_id STRING,
    order_date DATE,
    status STRING,
    amount DOUBLE
)
USING iceberg
PARTITIONED BY (days(order_date));

/* Create the customers table in the REST catalog */
CREATE TABLE IF NOT EXISTS rest_catalog.analytics.customers (
    customer_id STRING,
    name STRING,
    email STRING,
    state STRING,
    signup_date DATE
)
USING iceberg;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;PySpark Insert Script&lt;/h3&gt;
&lt;p&gt;We can execute an insert pipeline to populate data into these tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from datetime import date
from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType

/* Define data for orders */
order_data = [
    (&amp;quot;O1001&amp;quot;, &amp;quot;C2001&amp;quot;, date(2026, 5, 20), &amp;quot;COMPLETED&amp;quot;, 150.50),
    (&amp;quot;O1002&amp;quot;, &amp;quot;C2002&amp;quot;, date(2026, 5, 21), &amp;quot;PENDING&amp;quot;, 45.00),
    (&amp;quot;O1003&amp;quot;, &amp;quot;C2001&amp;quot;, date(2026, 5, 22), &amp;quot;COMPLETED&amp;quot;, 300.00)
]

schema_orders = StructType([
    StructField(&amp;quot;order_id&amp;quot;, StringType(), True),
    StructField(&amp;quot;customer_id&amp;quot;, StringType(), True),
    StructField(&amp;quot;order_date&amp;quot;, DateType(), True),
    StructField(&amp;quot;status&amp;quot;, StringType(), True),
    StructField(&amp;quot;amount&amp;quot;, DoubleType(), True)
])

df_orders = spark.createDataFrame(order_data, schema=schema_orders)
df_orders.write.format(&amp;quot;iceberg&amp;quot;).mode(&amp;quot;append&amp;quot;).save(&amp;quot;rest_catalog.analytics.orders&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configuring Trino for the REST Catalog&lt;/h3&gt;
&lt;p&gt;Trino can be configured to read and write to the same REST catalog by adding a catalog configuration file to its &lt;code&gt;etc/catalog/&lt;/code&gt; directory, for example, &lt;code&gt;etc/catalog/rest_catalog.properties&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-properties&quot;&gt;connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://rest-catalog-server:8181
iceberg.rest-catalog.security=OAUTH2
iceberg.rest-catalog.oauth2.credential=client_id:client_secret
iceberg.rest-catalog.warehouse=s3://my-shared-lakehouse-bucket/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this configuration, Trino users can query the table using the same name:
&lt;code&gt;SELECT * FROM rest_catalog.analytics.orders;&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;Configuring Apache Flink for the REST Catalog&lt;/h3&gt;
&lt;p&gt;For real-time streaming jobs, Apache Flink can connect to the REST catalog using its SQL client or Table API. The configuration is defined in Flink&apos;s SQL client configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE CATALOG rest_catalog WITH (
  &apos;type&apos;=&apos;iceberg&apos;,
  &apos;catalog-impl&apos;=&apos;org.apache.iceberg.rest.RESTCatalog&apos;,
  &apos;uri&apos;=&apos;http://rest-catalog-server:8181&apos;,
  &apos;credential&apos;=&apos;client_id:client_secret&apos;,
  &apos;warehouse&apos;=&apos;s3://my-shared-lakehouse-bucket/&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By standardizing catalog access, all three engines (Spark, Trino, Flink) can interact with the table metadata concurrently, with the catalog coordinating commits and enforcing isolation levels.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;8. Dremio Integration and Query Acceleration&lt;/h2&gt;
&lt;p&gt;Once PySpark has written the data and committed the transaction through the REST Catalog, other query engines can access the new records instantly. Dremio integrates directly with Iceberg REST catalogs, providing interactive query execution.&lt;/p&gt;
&lt;h3&gt;Connecting Dremio to the REST Catalog&lt;/h3&gt;
&lt;p&gt;To query the shared tables in Dremio, you add the catalog as a data source in the Dremio administrator console:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Add Source&lt;/strong&gt; in the Dremio UI.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Apache Iceberg REST Catalog&lt;/strong&gt; from the catalog list.&lt;/li&gt;
&lt;li&gt;Configure the connection parameters:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; &lt;code&gt;rest_catalog&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;REST URI:&lt;/strong&gt; &lt;code&gt;http://rest-catalog-server:8181&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication Type:&lt;/strong&gt; OAuth2 Client Credentials&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Client ID &amp;amp; Client Secret:&lt;/strong&gt; Input the credentials matching the Spark configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Under &lt;strong&gt;Storage Connection&lt;/strong&gt;, configure the S3 physical source parameters, referencing the warehouse path &lt;code&gt;s3://my-shared-lakehouse-bucket/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt; to initialize the source.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Once connected, Dremio displays the &lt;code&gt;analytics&lt;/code&gt; namespace and the tables (&lt;code&gt;orders&lt;/code&gt; and &lt;code&gt;customers&lt;/code&gt;) in its metadata catalog tree. You can query the tables using standard ANSI SQL without running any manual table synchronization jobs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Execute a cross-table join query in Dremio */
SELECT
    o.order_id,
    c.name,
    o.amount,
    o.order_date
FROM rest_catalog.analytics.orders o
JOIN rest_catalog.analytics.customers c
    ON o.customer_id = c.customer_id
WHERE o.status = &apos;COMPLETED&apos;
  AND o.amount &amp;gt; 100.00;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explaining Dremio Query Acceleration&lt;/h3&gt;
&lt;p&gt;While standard SQL execution works out of the box, the Dremio engine implements architectural optimizations that make reads significantly faster than raw storage queries.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                 Dremio Engine Acceleration Layer

              +------------------------------------+
              |          Dremio Optimizer          |
              |       (Apache Calcite Planner)      |
              +-----------------+------------------+
                                |
             Is there an active Reflection matched?
                     /                      \
                  (Yes)                     (No)
                   /                          \
+-----------------v---------------+    +-------v-----------------+
|   Rewrite Query to Reference    |    |   Direct Table Scan     |
|   Aggregation/Raw Reflection    |    |   Using Vectorized      |
|   (Pre-computed Parquet cache)  |    |   Using Apache Arrow     |
+---------------------------------+    +-------+-----------------+
                                               |
                                    Check Local Coordinator Cache
                                               |
                                    Are metadata files cached?
                                           /            \
                                        (Yes)           (No)
                                         /                \
                    +-------------------v---+    +---------v-------------+
                    | Skip Object Store API |    | Read Metadata JSON    |
                    | Call for Metadata     |    | from Cloud Storage    |
                    +-----------------------+    +-----------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Vectorized Memory Layout (Apache Arrow)&lt;/h4&gt;
&lt;p&gt;The Dremio engine uses Apache Arrow as its internal memory format. Apache Arrow is a columnar in-memory data layout that permits vectorized query processing. When Dremio reads Parquet files from S3, it processes data in memory without performing expensive row-to-column serialization and deserialization. By executing instructions directly on column arrays, the engine maximizes CPU cache locality and hardware performance.&lt;/p&gt;
&lt;h4&gt;Local Coordinator Metadata Cache&lt;/h4&gt;
&lt;p&gt;When an engine plans an Iceberg query, it must read the hierarchical metadata tree (the JSON metadata file, the manifest list, and individual manifest files). If the storage layer is cloud object storage (like S3), fetching these metadata files introduces significant network latency.&lt;/p&gt;
&lt;p&gt;The Dremio engine avoids this latency using a local coordinator cache. Dremio caches the parsed Iceberg metadata on its coordinator nodes. When a new query arrives, Dremio checks if the catalog has updated the table pointer. If the pointer has not changed, Dremio plans the query using the cached metadata, avoiding the need to make network requests to cloud storage. This optimization reduces query planning times from seconds to milliseconds.&lt;/p&gt;
&lt;h4&gt;Positional Delete Caching&lt;/h4&gt;
&lt;p&gt;Iceberg supports row-level updates and deletes using delete files (copy-on-write or merge-on-read). In merge-on-read tables, readers must read the base data files and apply delete files at runtime. Applying these deletes dynamically can degrade performance.&lt;/p&gt;
&lt;p&gt;The Dremio engine accelerates merge-on-read queries by caching positional delete files in memory. Rather than reloading delete files for every query scan, Dremio maintains an active cache of deleted row indexes, applying them to base data scans at memory speed.&lt;/p&gt;
&lt;h4&gt;Reflections and Calcite Cost-Based Optimization&lt;/h4&gt;
&lt;p&gt;The Dremio engine includes a query acceleration feature called Data Reflections. Reflections are pre-computed layouts of tables or joins that are stored as optimized Parquet files.&lt;/p&gt;
&lt;p&gt;When a user executes a query, the Dremio optimizer (built on Apache Calcite) checks if the query structure matches an active Reflection. Dremio uses Calcite to parse the query into an abstract syntax tree (AST) and then convert it into a logical algebra representation. The optimizer applies multiple transformation rules to check if a logical query block can be replaced by a pre-aggregated or raw reflection scan.&lt;/p&gt;
&lt;p&gt;This replacement is evaluated using a cost-based model. If the optimizer determines that reading the Reflection requires scanning fewer bytes and blocks than executing the original join or aggregation, it rewrites the query plan. This rewrite is transparent to the user, allowing queries that would typically scan millions of records to return in sub-second intervals.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;9. Catalog Migration Strategies&lt;/h2&gt;
&lt;p&gt;Migrating existing tables to a shared Iceberg catalog requires careful planning to prevent write downtime and verify metadata accuracy. Let us examine the two primary migration paths.&lt;/p&gt;
&lt;h3&gt;1. In-Place Catalog Migration (Register Table)&lt;/h3&gt;
&lt;p&gt;If you have existing Iceberg tables registered in a legacy catalog (such as Hive Metastore) and want to migrate to a REST catalog (like Polaris), you do not need to rewrite or move the data files.&lt;/p&gt;
&lt;p&gt;Because Iceberg tables are defined by their metadata JSON file, you can register the table directly in the new catalog:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Retrieve the current metadata JSON path from the source catalog.&lt;/li&gt;
&lt;li&gt;Ensure that no active transactions are running on the table.&lt;/li&gt;
&lt;li&gt;Execute a register command in the new catalog, pointing to the active metadata JSON location.&lt;/li&gt;
&lt;li&gt;Verify the registered table using Spark or Dremio.&lt;/li&gt;
&lt;li&gt;Decommission the old catalog reference for that table.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This register operation is a metadata-only transaction that completes in milliseconds, avoiding the need to read or rewrite physical Parquet blocks.&lt;/p&gt;
&lt;h3&gt;2. External Table Migration (Parquet to Iceberg)&lt;/h3&gt;
&lt;p&gt;If your legacy tables are stored as raw Parquet files (non-Iceberg format) and registered in Hive Metastore, you must upgrade them to Iceberg tables. You can achieve this using Spark&apos;s in-place metadata translation procedures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshot Procedure:&lt;/strong&gt; Creates a new Iceberg table by reading the existing Parquet files. The old Parquet table remains active.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate Procedure:&lt;/strong&gt; Converts the existing Parquet table directly into an Iceberg table, replacing the old table in the catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Convert existing parquet table to Iceberg in-place */
CALL rest_catalog.system.migrate(
    table =&amp;gt; &apos;analytics.orders&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This migration process reads the existing Parquet footers and generates corresponding Iceberg metadata files (manifests and JSON metadata), registering the new table in the REST catalog without rewriting the underlying data.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;10. Catalog Best Practices&lt;/h2&gt;
&lt;p&gt;To maintain a healthy lakehouse, organizations should adhere to several operational best practices when designing and managing their Iceberg catalogs.&lt;/p&gt;
&lt;h3&gt;Design a Single Source of Truth&lt;/h3&gt;
&lt;p&gt;Avoid registering the same physical table data files in multiple independent catalogs. For example, do not point an AWS Glue Catalog and a Project Nessie catalog to the same S3 directory. Because catalogs do not share transaction states, they cannot coordinate concurrent writes. If two catalogs modify the same files, they will overwrite each other&apos;s pointers, causing metadata corruption and data loss. Always designate a single catalog to act as the writer, and configure other engines to sync or read from that catalog.&lt;/p&gt;
&lt;h3&gt;Configure Automatic Metadata Cleanup&lt;/h3&gt;
&lt;p&gt;Every write operation to an Iceberg table creates new metadata files, manifest files, and physical data files. Over time, these historical files accumulate, increasing object storage costs and catalog pointer overhead.&lt;/p&gt;
&lt;p&gt;Implement a maintenance process to clean up historical files. In Spark, you can run system procedures to expire snapshots and remove orphan files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Expire snapshots older than 7 days */
CALL rest_catalog.system.expire_snapshots(
    table =&amp;gt; &apos;analytics.orders&apos;,
    older_than =&amp;gt; TIMESTAMP &apos;2026-05-15 00:00:00.000&apos;
);

/* Clean up physical files no longer tracked by metadata */
CALL rest_catalog.system.remove_orphan_files(
    table =&amp;gt; &apos;analytics.orders&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Implement Monotonic ID Counters for Column Resolution&lt;/h3&gt;
&lt;p&gt;Iceberg resolves table columns using unique column IDs, rather than column names. This design is what allows Iceberg to support schema evolution operations, like column renames and reordering, without rewriting data.&lt;/p&gt;
&lt;p&gt;When designing custom REST catalog servers or database integrations, ensure that column ID assignment is strictly monotonic. If column IDs are recycled during column drop and add operations, client engines can misalign columns during query planning, leading to incorrect query results.&lt;/p&gt;
&lt;h3&gt;Choose the REST Spec for Long Term Portability&lt;/h3&gt;
&lt;p&gt;When building a new data lakehouse, prioritize REST-compliant catalogs like Polaris or the standard Iceberg REST server. By using REST as the primary connection interface, you future-proof your architecture. If you decide to change your backend catalog database or migrate from AWS to Google Cloud, you can swap the REST catalog server without modifying the configuration of your query engines. This decoupling ensures your data lakehouse remains open and portable.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Maintaining Apache Iceberg Tables: Compaction, Snapshot Expiration, and Orphan File Cleanup</title><link>https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-maintenance-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-maintenance-compaction/</guid><description>
The core promise of an open data lakehouse is to deliver the scalability and low storage cost of an object store combined with the transactional reli...</description><pubDate>Fri, 22 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The core promise of an open data lakehouse is to deliver the scalability and low storage cost of an object store combined with the transactional reliability and performance of an enterprise data warehouse. To fulfill this promise, data platform administrators must establish structured table maintenance routines. Unlike traditional databases that manage storage layouts automatically behind proprietary interfaces, an open lakehouse exposes physical files on object storage directly to developers and query engines. This exposure provides extreme flexibility but shifts the responsibility of storage layout optimization to data engineers.&lt;/p&gt;
&lt;p&gt;In an active lakehouse environment where tables are constantly modified by streaming ingest pipelines and batch ETL processes, tables can degrade over time. Data files can fragment, metadata structures can grow bloated, and snapshots can accumulate, leading to increased query latency and rising cloud storage costs.&lt;/p&gt;
&lt;p&gt;In this comprehensive guide, we will explore the core maintenance tasks required to keep Apache Iceberg tables in a healthy, high-performance state. We will dissect the &amp;quot;small files problem,&amp;quot; compare compaction techniques (bin-packing, sort-based, and Z-Order spatial clustering), write Spark SQL scripts to execute maintenance operations, and explain how the Dremio engine leverages optimized layouts to achieve sub-second execution speeds.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. The &amp;quot;Small Files Problem&amp;quot; and Storage Layout Latency&lt;/h2&gt;
&lt;p&gt;In modern data lakehouses, data is often ingested in near-real-time from message queues like Apache Kafka, or via frequent micro-batch jobs running every few minutes. While this ingestion pattern ensures that fresh data is available quickly, it creates a significant structural issue: the creation of thousands of very small files on object storage. This is commonly referred to as the &amp;quot;small files problem.&amp;quot;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Legacy / Uncompacted Table Layout:
+-----------------------------------------------------------------------------------+
|  10,000 files x 10 KB = 100 MB total.                                             |
|  Querying requires 10,000 HTTP GET requests, causing high network connection delay.|
+-----------------------------------------------------------------------------------+

Compacted Table Layout:
+-----------------------------------------------------------------------------------+
|  1 file x 100 MB = 100 MB total.                                                  |
|  Querying requires 1 HTTP GET request, reading data at maximum network speed.     |
+-----------------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Physics of Object Storage Latency&lt;/h3&gt;
&lt;p&gt;Amazon S3 and other cloud object stores are designed for high throughput and durability, but they have non-trivial connection establishment latency. Every HTTP request sent to S3 incurs overhead. This includes DNS resolution, TCP handshake setup, TLS negotiation, and S3 internal request routing. Typically, the time to first byte for a GET request is between 10 and 50 milliseconds.&lt;/p&gt;
&lt;p&gt;Suppose a query engine needs to scan a table containing 10,000 files of 10 kilobytes each. The actual data volume is only 100 megabytes. However, to read this data, the query engine must execute 10,000 separate GET requests. If executed sequentially, the network overhead would consume minutes. Even when executed in parallel, the overhead is substantial and can trigger S3 rate limiting (which caps requests at 5,500 GET requests per second per prefix), leading to HTTP 503 throttling errors.&lt;/p&gt;
&lt;p&gt;Conversely, if those same records are compacted into a single 100-megabyte Parquet file, the query engine performs a single GET request, establishes a single connection, and reads the entire dataset at the maximum bandwidth of the network connection. By consolidating data, compaction eliminates request latency, prevents S3 throttling, and lowers your AWS billing charges (since S3 charges per 1,000 API requests).&lt;/p&gt;
&lt;h3&gt;How Parquet Layouts Interact with File Size&lt;/h3&gt;
&lt;p&gt;The Parquet file format is columnar, meaning it stores data columns adjacent to each other rather than rows. Parquet files organize data into row groups, column chunks, and pages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row Groups&lt;/strong&gt;: Horizontal divisions of data within the file. A typical row group contains between 100,000 and 1,000,000 rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column Chunks&lt;/strong&gt;: Column-specific segments within a row group.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pages&lt;/strong&gt;: The smallest unit of physical storage, containing actual values, repetition levels, and definition levels.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To plan queries efficiently, engines read Parquet file footers. The footer contains metadata, including min/max values for every column within each row group. If a file is too small (for instance, containing only a few thousand rows), the row groups are tiny, and the overhead of reading file footers outweighs the benefits of columnar skipping. Storing data in optimized file sizes (typically between 128 MB and 512 MB) ensures that row groups are large enough to make columnar skip statistics effective, while remaining small enough to be easily processed in parallel.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. In-Depth Analysis of Compaction Techniques&lt;/h2&gt;
&lt;p&gt;Compaction is the process of reading existing small files, consolidating their contents, and writing them out as larger, optimized files. In Apache Iceberg, compaction is a metadata-only rewrite operation. The data files are physically written, and a new snapshot is created, but the logical content of the table does not change. Iceberg supports three primary compaction strategies.&lt;/p&gt;
&lt;h3&gt;Bin-Packing&lt;/h3&gt;
&lt;p&gt;Bin-packing is the fastest and least resource-intensive compaction strategy. It operates on a simple algorithm: it groups small files into &amp;quot;bins&amp;quot; based on file size and writes each bin out as a new, larger file.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Input Files: [10MB, 15MB, 5MB, 20MB, 80MB, 12MB, 8MB]
Target File Size: 100MB

Bin 1: [10MB, 15MB, 5MB, 20MB] -&amp;gt; Compacted into File A (50MB)
Bin 2: [80MB, 12MB, 8MB]       -&amp;gt; Compacted into File B (100MB)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Bin-packing does not reorganize or sort the rows within the files. It simply reads the raw rows and packs them sequentially. Because it does not sort the data, bin-packing requires minimal CPU cycles and memory, making it highly cost-effective. It is ideal as a first-line defense against the small files problem in high-frequency ingest pipelines. However, because it does not reorganize row order, it does not optimize the table for specific query search paths.&lt;/p&gt;
&lt;h3&gt;Sort-Based Compaction and Spark Execution Trade-Offs&lt;/h3&gt;
&lt;p&gt;Sort-based compaction reads small files, sorts the rows based on one or more target columns, and writes the sorted data out into large files.&lt;/p&gt;
&lt;p&gt;For example, if you frequently query the &lt;code&gt;analytics.orders&lt;/code&gt; table filtering by &lt;code&gt;customer_id&lt;/code&gt;, you can run sort compaction using &lt;code&gt;customer_id&lt;/code&gt; as the sort key. This groups all records with the same customer ID into adjacent physical locations within the Parquet files.&lt;/p&gt;
&lt;p&gt;Sorting dramatically improves query performance by maximizing Parquet min/max statistics skipping. If a query filters for &lt;code&gt;customer_id = &apos;C001&apos;&lt;/code&gt;, the query engine inspects the min/max statistics in the row group footers. In a sorted file, only a few row groups will contain values matching &lt;code&gt;&apos;C001&apos;&lt;/code&gt;. The query engine skips all other row groups, reducing disk reads. If the file was not sorted, &lt;code&gt;&apos;C001&apos;&lt;/code&gt; records would be scattered across every row group in every file, forcing the engine to scan the entire dataset.&lt;/p&gt;
&lt;p&gt;When orchestrating sort-based compaction in Spark, you must choose between partition-level sorting and global sorting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partition Sorting&lt;/strong&gt;: Sorts data files within each logical partition independently. If your table is partitioned by day, Spark sorts the data inside each day folder. This is fast and restricts shuffles to individual partition boundaries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global Sorting&lt;/strong&gt;: Sorts data across the entire table, regardless of logical partitions. This requires a global range partitioner in Spark, resulting in a large network shuffle as data is redistributed across all Spark executors. Global sorting is more expensive but provides the highest possible query performance for non-partitioned tables or tables with coarse partitioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Z-Order Spatial Clustering (Space-Filling Curves)&lt;/h3&gt;
&lt;p&gt;While sort-based compaction is highly effective when filtering on a single column, it has a significant limitation: you must prioritize one column over another. If you sort a table by &lt;code&gt;customer_id&lt;/code&gt; and then by &lt;code&gt;order_date&lt;/code&gt;, the data is organized primarily by customer ID. Within each customer ID, the records are sorted by date. If you query the table filtering only by &lt;code&gt;order_date&lt;/code&gt;, the min/max statistics are ineffective because dates are scattered across different customer ID blocks.&lt;/p&gt;
&lt;p&gt;Z-Order clustering solves this by organizing data along a multi-dimensional space-filling curve. A space-filling curve maps multi-dimensional attributes into a single-dimensional line while maintaining spatial locality. This means that points that are close to each other in multi-dimensional space remain close to each other in physical storage.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Z-Order Bit Interleaving Example:
Suppose we want to cluster by customer_id (integer representable) and order_date (integer days).

customer_id (binary): 0 1 0 1
order_date (binary):  1 1 0 0

Interleaved Z-Address: 0 1 1 1 0 0 1 0 (taking alternating bits)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To Z-Order data, the compaction engine takes the binary representations of the target columns and interleaves their bits to create a single coordinate (a Z-address). The data is then sorted based on this Z-address.&lt;/p&gt;
&lt;p&gt;The mapping function $f: \mathbb{R}^d \to \mathbb{R}$ transforms multidimensional parameters into a single dimension. To perform Z-ordering on a dataset, the Spark coordinator first scans the target columns (for example, &lt;code&gt;customer_id&lt;/code&gt; and &lt;code&gt;order_date&lt;/code&gt;) to determine their minimum and maximum ranges. It then projects these values onto an integer grid. Once projected, the bits of the binary coordinate representations are interleaved:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let the coordinate values be represented as binary arrays. For a point $(x, y)$, where $x = x_1 x_2 ... x_k$ and $y = y_1 y_2 ... y_k$, the Z-address is constructed by taking alternate bits: $z = x_1 y_1 x_2 y_2 ... x_k y_k$.&lt;/li&gt;
&lt;li&gt;Spark sorts the rows based on this interleaved Z-value.&lt;/li&gt;
&lt;li&gt;The sorted records are written sequentially to Parquet files on S3.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because the Z-address is built by alternating bits, a query filtering on either $x$ or $y$ (or both) can skip files. The Z-curve traces a recursive Z-shaped fractal path through the coordinate space. When the query engine applies a filter, it calculates the range of Z-values that could contain matching data, compares it to the min/max Z-addresses in each Parquet file footer, and skips the files that do not overlap.&lt;/p&gt;
&lt;h4&gt;Z-Order vs. Hilbert Curves&lt;/h4&gt;
&lt;p&gt;While Z-Ordering is widely used, it has some spatial partitioning issues. At quadrant boundaries, Z-Order curves make sudden jumps. For example, two adjacent points located on opposite sides of a main quadrant division line can end up with highly different Z-addresses, separating them physically on disk.&lt;/p&gt;
&lt;p&gt;The Hilbert curve is an alternative space-filling curve that avoids these sudden jumps by dynamically rotating the coordinate grid at each level of detail. This rotation ensures that the curve never crosses itself and maintains smoother spatial locality. Some advanced compaction runtimes support Hilbert curve sorting, which can offer slightly better read performance than Z-Order, though at the expense of even higher compaction CPU overhead.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;3. Configuring and Executing Compaction in Apache Spark&lt;/h2&gt;
&lt;p&gt;Compaction in Apache Iceberg is typically orchestrated using Apache Spark. Spark provides native SQL procedures to execute compaction on target tables.&lt;/p&gt;
&lt;p&gt;We will use our standard analytical tables for these examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;analytics.orders&lt;/code&gt; (fields: &lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;analytics.customers&lt;/code&gt; (fields: &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;signup_date&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;First, we set up our Spark session to connect to our Iceberg catalog:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession

# Initialize Spark Session configured with Iceberg catalog
spark = SparkSession.builder \
    .appName(&amp;quot;IcebergTableCompaction&amp;quot;) \
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog&amp;quot;, &amp;quot;org.apache.iceberg.spark.SparkCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog.catalog-impl&amp;quot;, &amp;quot;org.apache.iceberg.aws.glue.GlueCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.glue_catalog.warehouse&amp;quot;, &amp;quot;s3://my-lakehouse-bucket/warehouse/&amp;quot;) \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Executing Bin-Packing Compaction&lt;/h3&gt;
&lt;p&gt;To run a fast bin-packing compaction on the &lt;code&gt;analytics.orders&lt;/code&gt; table, we call the &lt;code&gt;rewrite_data_files&lt;/code&gt; procedure using Spark SQL.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Execute bin-packing compaction using Spark SQL procedures
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.rewrite_data_files(
      table =&amp;gt; &apos;analytics.orders&apos;,
      strategy =&amp;gt; &apos;binpack&apos;,
      options =&amp;gt; map(
        &apos;target-file-size-bytes&apos;, &apos;536870912&apos;, /* 512 MB target size */
        &apos;min-input-files&apos;, &apos;10&apos;                 /* Only compact partitions with &amp;gt;= 10 files */
      )
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this procedure call:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;strategy =&amp;gt; &apos;binpack&apos;&lt;/code&gt;: Specifies that we are using the bin-packing algorithm.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;target-file-size-bytes&lt;/code&gt;: Instructs the engine to target 512 megabytes per output file.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;min-input-files&lt;/code&gt;: Prevents Spark from spending compute resources on partitions that are already clean (containing fewer than 10 files).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Executing Z-Order Compaction&lt;/h3&gt;
&lt;p&gt;For the &lt;code&gt;analytics.orders&lt;/code&gt; table, we can Z-Order the data by &lt;code&gt;customer_id&lt;/code&gt; and &lt;code&gt;order_date&lt;/code&gt; to accelerate join queries and time-series reports.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Execute Z-Order compaction on the orders table
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.rewrite_data_files(
      table =&amp;gt; &apos;analytics.orders&apos;,
      strategy =&amp;gt; &apos;sort&apos;,
      sort_order =&amp;gt; &apos;zorder(customer_id, order_date)&apos;,
      options =&amp;gt; map(
        &apos;target-file-size-bytes&apos;, &apos;536870912&apos;, /* 512 MB target */
        &apos;max-file-group-size-bytes&apos;, &apos;107374182400&apos; /* Process in 100 GB groups */
      )
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Compacting Manifest Files&lt;/h3&gt;
&lt;p&gt;In addition to compacting data files, you should also compact metadata manifest files. Every time a write job runs, Iceberg writes a new manifest file. Over time, tables can accumulate hundreds of small manifest files, which slows down query planning.&lt;/p&gt;
&lt;p&gt;We can merge small manifest files using the &lt;code&gt;rewrite_manifests&lt;/code&gt; procedure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Compact metadata manifests to optimize query planning
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.rewrite_manifests(
      table =&amp;gt; &apos;analytics.orders&apos;
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Advanced Compaction Configurations and Partial Progress&lt;/h3&gt;
&lt;p&gt;When executing compaction on large fact tables (containing terabytes of data), running the entire compaction inside a single large transaction is risky. If the compaction job takes several hours and another writer commits an update to the table in the meantime, the compaction&apos;s CAS transaction may fail, forcing you to rerun the entire compaction.&lt;/p&gt;
&lt;p&gt;To prevent this, you can configure partial progress. Partial progress divides the compaction job into separate file groups and commits each group independently.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Run compaction with partial progress enabled
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.rewrite_data_files(
      table =&amp;gt; &apos;analytics.orders&apos;,
      options =&amp;gt; map(
        &apos;partial-progress.enabled&apos;, &apos;true&apos;,
        &apos;partial-progress.max-commits&apos;, &apos;10&apos;,
        &apos;max-concurrent-file-group-rewrites&apos;, &apos;4&apos;,
        &apos;target-file-size-bytes&apos;, &apos;536870912&apos;
      )
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let us dissect these properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&apos;partial-progress.enabled&apos; = &apos;true&apos;&lt;/code&gt;: Instructs Iceberg to commit compacted files in smaller batches rather than waiting for the entire table compaction to finish.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&apos;partial-progress.max-commits&apos; = &apos;10&apos;&lt;/code&gt;: Limits the number of commits Iceberg can execute during this job, ensuring we do not overload the catalog with too many micro-commits.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&apos;max-concurrent-file-group-rewrites&apos; = &apos;4&apos;&lt;/code&gt;: Allows Spark to compile up to 4 file groups in parallel, maximizing cluster utilization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Restricting Compaction with Filters&lt;/h3&gt;
&lt;p&gt;In many production environments, you only need to compact recent partitions. For example, if data is written continuously to the current day&apos;s partition, the historical partitions are already static and compacted. Compacting the entire table every night wastes massive amounts of cluster time.&lt;/p&gt;
&lt;p&gt;You can restrict compaction to a specific partition or data range using the &lt;code&gt;where&lt;/code&gt; option:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Compact only the data written in the month of May 2026
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.rewrite_data_files(
      table =&amp;gt; &apos;analytics.orders&apos;,
      where =&amp;gt; &apos;order_date &amp;gt;= CAST(&amp;quot;2026-05-01&amp;quot; AS DATE)&apos;
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Programmatic Compaction via PySpark Actions API&lt;/h3&gt;
&lt;p&gt;In addition to running compaction using standard Spark SQL queries, you can orchestrate compactions programmatically using PySpark&apos;s access to the Java classes in the Spark actions framework. This is highly useful when building custom Python script orchestrators that are executed by Airflow or AWS Glue.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Reference the Java Classes using the PySpark JVM gateway
jvm = spark._jvm
table_identifier = jvm.org.apache.iceberg.catalog.TableIdentifier.of(&amp;quot;analytics&amp;quot;, &amp;quot;orders&amp;quot;)
iceberg_catalog = spark._jsparkSession.sessionState().catalogManager().catalog(&amp;quot;glue_catalog&amp;quot;)
java_table = iceberg_catalog.loadTable(table_identifier)

# Construct and execute the RewriteDataFiles action programmatically
actions = jvm.org.apache.iceberg.spark.actions.SparkActions.get(spark._jsparkSession)
result = actions.rewriteDataFiles(java_table) \
    .binPack() \
    .filter(jvm.org.apache.iceberg.expressions.Expressions.greaterThanOrEqual(&amp;quot;order_date&amp;quot;, &amp;quot;2026-05-01&amp;quot;)) \
    .option(&amp;quot;target-file-size-bytes&amp;quot;, &amp;quot;536870912&amp;quot;) \
    .execute()

# Print execution summaries
print(f&amp;quot;Compacted data files count: {result.rewrittenDataFilesCount()}&amp;quot;)
print(f&amp;quot;Added data files count: {result.addedDataFilesCount()}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This Java interface interaction allows python processes to capture granular result payloads (such as the list of files removed and added) directly in their runtime variables, enabling programmatic logging and diagnostics.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Operational Cleaning: Snapshot Expiration and Orphan Files&lt;/h2&gt;
&lt;p&gt;While compaction optimizes active data layouts, tables still accumulate historical files that are no longer needed. To keep storage costs under control and maintain catalog performance, you must prune these historical assets.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg Metadata Architecture&lt;/h3&gt;
&lt;p&gt;To understand how snapshot expiration works, we must inspect the internal schemas of Iceberg metadata files:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Metadata JSON File&lt;/strong&gt;: Holds the table&apos;s schema, partitioning layout, and a log of all snapshots. It references a Manifest List file for each snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest List File&lt;/strong&gt;: A binary Avro file that lists the manifest files associated with a specific snapshot. Its fields include:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;manifest_path&lt;/code&gt;: The URI of the manifest file.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;added_snapshot_id&lt;/code&gt;: The ID of the snapshot that added the manifest.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;added_data_files_count&lt;/code&gt;: Number of data files added in this manifest.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;partitions&lt;/code&gt;: Min/max values of partition fields within the manifest (used for query planning partition pruning).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest File&lt;/strong&gt;: A binary Avro file that lists individual data and delete files. Its schema contains entry states (&lt;code&gt;0&lt;/code&gt; for existing, &lt;code&gt;1&lt;/code&gt; for added, &lt;code&gt;2&lt;/code&gt; for deleted) and a &lt;code&gt;data_file&lt;/code&gt; struct:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;file_path&lt;/code&gt;: The physical location of the Parquet file on S3.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;file_format&lt;/code&gt;: The storage format (e.g. Parquet).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;record_count&lt;/code&gt;: Number of rows in the data file.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;column_sizes&lt;/code&gt;: Map of column IDs to bytes stored.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;value_counts&lt;/code&gt; and &lt;code&gt;null_value_counts&lt;/code&gt;: Data distribution stats.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lower_bounds&lt;/code&gt; and &lt;code&gt;upper_bounds&lt;/code&gt;: Min/max values for every column chunk (used for row group skipping).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Detailed Manifest Field Properties&lt;/h3&gt;
&lt;p&gt;The fields stored inside the &lt;code&gt;data_file&lt;/code&gt; struct are crucial for metadata-level query pruning:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;lower_bounds&lt;/code&gt; and &lt;code&gt;upper_bounds&lt;/code&gt;: These maps store the minimum and maximum binary values for each column ID. When an engine receives a filter (for example, &lt;code&gt;WHERE amount &amp;gt; 500.0&lt;/code&gt;), it checks these ranges. If the file statistics indicate that the maximum value in the file is &lt;code&gt;450.0&lt;/code&gt;, the query engine skips parsing the file.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;null_value_counts&lt;/code&gt; and &lt;code&gt;nan_value_counts&lt;/code&gt;: These arrays track how many rows in the column contain null values or floating-point NaN values. If a query filters for non-null values and a file contains only nulls, the file is bypassed immediately.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;column_sizes&lt;/code&gt;: Tracks the compressed byte sizes of each column chunk, allowing the query planner to calculate the memory required to load specific columns before reading the file from S3.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snapshot Expiration Mechanics&lt;/h3&gt;
&lt;p&gt;Iceberg&apos;s time travel feature allows you to query table states from days or weeks ago. Every time you insert, update, or delete data, Iceberg creates a new metadata snapshot. These historical snapshots reference physical data files on S3.&lt;/p&gt;
&lt;p&gt;If you keep every snapshot forever, your storage consumption will grow continuously. To limit this growth, you should establish a snapshot retention window (such as 7 days) and expire older snapshots.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Expire snapshots older than seven days, retaining at least three snapshots
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.expire_snapshots(
      table =&amp;gt; &apos;analytics.orders&apos;,
      older_than =&amp;gt; CAST(current_timestamp() - INTERVAL 7 DAYS AS TIMESTAMP),
      retain_last =&amp;gt; 3
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When &lt;code&gt;expire_snapshots&lt;/code&gt; runs, the metadata changes as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Identify Expirations&lt;/strong&gt;: Iceberg scans the table history log in the metadata JSON file, finding all snapshots created before the &lt;code&gt;older_than&lt;/code&gt; threshold.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Filter Last Snapshots&lt;/strong&gt;: It protects the last &lt;code&gt;retain_last&lt;/code&gt; snapshots from expiration, preserving basic history.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compare Reference Sets&lt;/strong&gt;: It loads the manifest list files for the surviving snapshots and compiles a set of all active manifest paths and data file paths. It then loads the manifest list files for the expired snapshots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluate Manifest Reuse&lt;/strong&gt;: In Iceberg, multiple snapshots can share the same manifest files. If an expired snapshot references a manifest file that is also referenced by a surviving snapshot, that manifest file is retained.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reconcile Deletion List&lt;/strong&gt;: Physical Parquet data files are added to a deletion list only if they are referenced in the manifests of expired snapshots and are not referenced by any manifest in any active snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delete and Commit&lt;/strong&gt;: The coordinator deletes the orphan Parquet files and expired manifest list files from S3, writes a new metadata JSON file that removes the expired snapshots from the table&apos;s snapshot history array, and commits the pointer swap in Glue catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This reference-tracking architecture prevents deleting data that is still active, while purging old files safely.&lt;/p&gt;
&lt;h3&gt;Removing Orphan Files&lt;/h3&gt;
&lt;p&gt;Under normal operations, Iceberg tracks all files. However, write failures can sometimes cause files to accumulate on S3 without being registered in the metadata catalog. For instance, if a Spark executor crashes halfway through a write transaction, it may have already written several Parquet files to the S3 bucket. Because the transaction was never committed to the Glue Catalog, these files are not linked to any metadata snapshot. They are &amp;quot;orphan files.&amp;quot;&lt;/p&gt;
&lt;p&gt;Orphan files are invisible to query engines, but they continue to consume S3 storage space and increase your cloud storage costs. You can clean them using the &lt;code&gt;remove_orphan_files&lt;/code&gt; procedure.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Purge orphan files older than three days from the orders storage location
spark.sql(&amp;quot;&amp;quot;&amp;quot;
    CALL glue_catalog.system.remove_orphan_files(
      table =&amp;gt; &apos;analytics.orders&apos;,
      older_than =&amp;gt; CAST(current_timestamp() - INTERVAL 3 DAYS AS TIMESTAMP)
    )
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;remove_orphan_files&lt;/code&gt; procedure:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Scans the physical S3 directory associated with the table.&lt;/li&gt;
&lt;li&gt;Reads the table&apos;s metadata tree to construct a list of all files that are officially registered in active snapshots.&lt;/li&gt;
&lt;li&gt;Compares the physical file list with the registered file list.&lt;/li&gt;
&lt;li&gt;Identifies any physical files on S3 that are not in the metadata list and deletes them.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We configure the &lt;code&gt;older_than&lt;/code&gt; parameter to 3 days to prevent deleting files from active, running write jobs that have not yet committed their transactions.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;5. Query Performance and the Dremio Acceleration Layer&lt;/h2&gt;
&lt;p&gt;Establishing regular compaction, snapshot expiration, and manifest rewriting routines ensures that your Iceberg tables remain in an optimal state for analytical query engines. This optimization is especially beneficial when querying data through a Dremio engine.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                         Dremio Optimization Pipeline
+--------------------+
|   User SQL Query   |
+---------+----------+
          |
          v
+--------------------+
| Dremio Coordinator | --&amp;gt; Caches Iceberg metadata locally.
+---------+----------+     Plans queries in milliseconds, skipping S3 lookups.
          |
          v
+--------------------+
|   Dremio Executor  | --&amp;gt; Scans compacted Parquet files column-by-column.
+---------+----------+     Loads values directly into Apache Arrow memory.
          |
          +--------------&amp;gt; Applies cached positional delete lists in-memory.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Dremio engine is an open lakehouse query accelerator designed for interactive analytics. It bypasses traditional storage bottlenecks using several structural features.&lt;/p&gt;
&lt;h3&gt;1. Vectorized Memory Scans via Apache Arrow&lt;/h3&gt;
&lt;p&gt;Dremio uses Apache Arrow as its internal in-memory execution format. Apache Arrow represents data in column arrays rather than rows, matching the physical layout of Parquet files.&lt;/p&gt;
&lt;p&gt;When Dremio reads compacted Parquet files, it loads the column data chunks directly into Arrow memory buffers. Because the format is identical, the engine avoids the CPU overhead of serializing and deserializing rows. When tables are compacted into clean 512 MB files, Dremio read tasks can stream column segments into memory at hardware speeds.&lt;/p&gt;
&lt;h3&gt;2. Caching Positional Delete Files&lt;/h3&gt;
&lt;p&gt;In Iceberg tables, row-level updates and deletes are often managed using the Merge-on-Read (MoR) strategy. Instead of rewriting an entire Parquet file to delete a single row, the writer creates a small &amp;quot;positional delete file&amp;quot; listing the file path and row index of the deleted record.&lt;/p&gt;
&lt;p&gt;When querying the table, standard query engines must read the base data files and the delete files, join them in memory, and filter out the deleted rows. If a table contains thousands of uncompacted delete files, this join operation becomes a massive CPU bottleneck.&lt;/p&gt;
&lt;p&gt;The Dremio engine accelerates this by caching positional delete files in memory. Dremio loads these deleted row indexes into an active coordinator cache. When an executor scans a base Parquet file, Dremio applies the cached delete index mask in memory, avoiding the need to load and parse delete files from S3 for every query.&lt;/p&gt;
&lt;h3&gt;3. Local Coordinator Metadata Caching&lt;/h3&gt;
&lt;p&gt;Query planning in Iceberg requires traversing the metadata tree: reading the catalog pointer, loading the metadata JSON, parsing the manifest list, and reading the manifest files. If the catalog is remote and S3 network latency is high, this planning phase can take several seconds.&lt;/p&gt;
&lt;p&gt;Dremio eliminates this overhead by maintaining a local metadata cache on its coordinator nodes. When a query is executed, Dremio compares the table version pointer in the Glue Catalog. If the version has not changed, Dremio plans the query using its local metadata cache, reducing query startup latency from seconds to milliseconds.&lt;/p&gt;
&lt;h3&gt;4. Data Reflections Refresh Management&lt;/h3&gt;
&lt;p&gt;Dremio includes an automatic query acceleration feature called Data Reflections. Reflections are physically optimized representations of datasets stored as Parquet files on S3.&lt;/p&gt;
&lt;p&gt;For example, we can configure an Aggregation Reflection on our joined &lt;code&gt;analytics.orders&lt;/code&gt; and &lt;code&gt;analytics.customers&lt;/code&gt; dataset. When a user runs a query, Dremio&apos;s optimizer (which utilizes Apache Calcite) parses the query into a logical algebra tree. Calcite compares this tree with the structures of active Reflections. If a match is found, the optimizer rewrites the query plan to scan the Reflection rather than the raw tables, returning results in milliseconds.&lt;/p&gt;
&lt;p&gt;For these Reflections to operate efficiently, the underlying Iceberg tables must be regularly compacted. If the base tables are fragmented, Dremio&apos;s Reflection refresh jobs take longer to run, consuming excess cluster resources.&lt;/p&gt;
&lt;p&gt;Dremio manages Reflections using scheduled refresh cycles. When a reflection is scheduled for update, Dremio checks the metadata JSON log. If the changes are append-only (new files added), Dremio can execute an incremental refresh, reading only the newly added Parquet files and appending their results to the Reflection storage location. However, if a compaction job or row-level update has rewritten the base files (changing the physical file layouts), Dremio must execute a full refresh, reading the entire base table and rebuilding the Reflection Parquet files. Keeping tables compacted ensures that full refreshes are completed quickly without impacting database cluster resources.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Maintenance Scheduling and S3 Storage Tiering Integration&lt;/h2&gt;
&lt;p&gt;To keep your open lakehouse running smoothly, you should automate maintenance tasks using a scheduler like Apache Airflow or AWS Glue Workflows.&lt;/p&gt;
&lt;h3&gt;Storage Optimization via S3 Intelligent-Tiering&lt;/h3&gt;
&lt;p&gt;While compaction and cleanups manage file volume, long-term storage costs can still accumulate. Many analytical datasets have a strict access decay curve: recent data (for example, the last 30 days) is queried constantly, while historical data (older than 90 days) is rarely accessed but must be retained for compliance or year-over-year reporting.&lt;/p&gt;
&lt;p&gt;To optimize costs without introducing operational complexity, you should configure Amazon S3 Intelligent-Tiering on your lakehouse bucket. S3 Intelligent-Tiering automatically monitors access patterns at the object level and transitions inactive Parquet files to lower-cost access tiers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Frequent Access Tier&lt;/strong&gt;: Default storage state. Data is read here at regular rates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Infrequent Access Tier&lt;/strong&gt;: If an object is not accessed for 30 consecutive days, S3 moves it here, saving up to 40 percent on storage costs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Archive Instant Access Tier&lt;/strong&gt;: If an object remains unaccessed for 90 consecutive days, it transitions here, saving up to 68 percent.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because Iceberg references exact file paths in metadata, query engines continue to access objects directly without changes, and S3 handles the tier promotion instantly if an old partition is suddenly queried. By combining Iceberg&apos;s compaction (which ensures files are large and optimized for tier transitions) with S3 Intelligent-Tiering, you build an automated, low-cost long-term storage system.&lt;/p&gt;
&lt;h3&gt;Recommended Scheduling Checklist&lt;/h3&gt;
&lt;h4&gt;Daily Tasks&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bin-Packing Compaction&lt;/strong&gt;: Run daily bin-packing on highly active partitions in the &lt;code&gt;analytics.orders&lt;/code&gt; table to merge small files written by streaming ingest pipelines during the day.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor S3 Throttling&lt;/strong&gt;: Review CloudWatch metrics for S3 &lt;code&gt;5xx&lt;/code&gt; errors. If throttling occurs, verify that Iceberg&apos;s prefix hashing features are enabled.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Weekly Tasks&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Z-Order Compaction&lt;/strong&gt;: Run sort-based or Z-Order compaction on the &lt;code&gt;analytics.orders&lt;/code&gt; table during off-peak hours (such as over the weekend) to reorganize the data by &lt;code&gt;customer_id&lt;/code&gt; and &lt;code&gt;order_date&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot Expiration&lt;/strong&gt;: Run &lt;code&gt;expire_snapshots&lt;/code&gt; with a 7-day retention window to clean up historical Parquet files and release S3 storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest Compaction&lt;/strong&gt;: Run &lt;code&gt;rewrite_manifests&lt;/code&gt; to consolidate metadata files and maintain fast query planning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Monthly Tasks&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Orphan File Cleanup&lt;/strong&gt;: Run &lt;code&gt;remove_orphan_files&lt;/code&gt; with an &lt;code&gt;older_than&lt;/code&gt; threshold of 3 days to purge abandoned files from failed writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review Table Properties&lt;/strong&gt;: Audit table metadata retention properties to ensure that snapshot retention parameters match business compliance needs.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;7. Summary&lt;/h2&gt;
&lt;p&gt;Building an open lakehouse on AWS using Apache Iceberg, the AWS Glue Catalog, and S3 provides a reliable, cost-efficient, and scalable foundation for enterprise data platforms. However, maintaining high performance requires regular attention to the physical layout of your data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implement &lt;strong&gt;bin-packing compaction&lt;/strong&gt; daily to merge small files written by streaming ingestion pipelines.&lt;/li&gt;
&lt;li&gt;Run &lt;strong&gt;Z-Order compaction&lt;/strong&gt; weekly to group related rows spatially, enabling efficient column group skipping during queries.&lt;/li&gt;
&lt;li&gt;Execute &lt;strong&gt;snapshot expiration&lt;/strong&gt; and &lt;strong&gt;orphan file cleanup&lt;/strong&gt; regularly to release cloud storage and lower S3 costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Optimizing your physical storage layouts ensures that query engines like &lt;strong&gt;AWS Athena&lt;/strong&gt; can execute ad-hoc analysis efficiently, and that the &lt;strong&gt;Dremio engine&lt;/strong&gt; can leverage its vectorized Arrow execution, metadata cache, and Data Reflections to deliver sub-second query performance for interactive analytical applications.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg with Spark: Create, MERGE, Upsert, and Evolve Tables End to End</title><link>https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-spark-dml-evolution/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-apache-iceberg-spark-dml-evolution/</guid><description>
The open data lakehouse architecture separates query execution from physical data storage, allowing organizations to deploy specialized engines for d...</description><pubDate>Fri, 22 May 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The open data lakehouse architecture separates query execution from physical data storage, allowing organizations to deploy specialized engines for different workloads. Within this ecosystem, Apache Spark acts as a powerful processing engine for large scale data transformation, batch ingestion, and complex analytical pipelines. However, running Spark directly on top of legacy data lakes using raw file formats like Parquet or JSON introduces significant operational challenges. Without a transactional catalog, concurrent writes can corrupt data, schema changes require rewriting complete tables, and listing directories across cloud storage introduces high latency.&lt;/p&gt;
&lt;p&gt;Apache Iceberg addresses these limitations by providing a logical table metadata layer. It enables acid transaction guarantees, snapshot isolation, hidden partitioning, and in-place schema evolution. When integrated with Apache Spark, Iceberg allows data engineers to execute transactional writes, perform upserts using SQL queries, and alter table layouts without interrupting downstream readers.&lt;/p&gt;
&lt;p&gt;This guide provides a comprehensive walkthrough of integrating Apache Spark with Apache Iceberg. We explore catalog configuration, schema setup, transactional write patterns, and schema evolution. We also analyze the differences between Copy on Write and Merge on Read table modes, showing how high performance query engines like Dremio accelerate read execution over Spark written tables.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. Integrating Apache Spark and Apache Iceberg&lt;/h2&gt;
&lt;p&gt;To run Apache Iceberg operations inside Apache Spark, the Spark engine must interface with the Iceberg metadata library and catalog systems. This integration is handled by the Spark DataSourceV2 (DSv2) API. The DSv2 framework allows Spark to delegate metadata tracking, file routing, and transaction commits directly to Iceberg. This delegation bypasses Spark&apos;s legacy file writer interfaces, ensuring that Spark can write data safely while Iceberg coordinates the transaction.&lt;/p&gt;
&lt;h3&gt;Spark Extensions and Catalogs&lt;/h3&gt;
&lt;p&gt;Integrating Iceberg requires configuring Spark to utilize specialized extensions. The principal extension is &lt;code&gt;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&lt;/code&gt;. This extension modifies Spark&apos;s Catalyst Optimizer, adding support for Iceberg specific SQL statements such as &lt;code&gt;MERGE INTO&lt;/code&gt;, &lt;code&gt;CALL&lt;/code&gt; procedures, and alter commands.&lt;/p&gt;
&lt;p&gt;Additionally, you must define one or more catalogs. Catalogs track the current state of tables by maintaining a pointer to the active metadata JSON file. Common catalog implementations include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog&lt;/strong&gt;: The standard, engine neutral REST interface (such as Apache Polaris or Project Nessie) that manages table pointers via secure HTTP endpoints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue Catalog&lt;/strong&gt;: A cloud native service that tracks table locations and schema structures within the AWS environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hive Metastore&lt;/strong&gt;: The legacy catalog pattern that uses a relational database to track table directory structures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hadoop Catalog&lt;/strong&gt;: A file system based catalog that uses folder paths and metadata files to track table pointers directly.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By defining these catalogs in the Spark configuration, you allow Spark to resolve table names, fetch active schemas, and commit transaction snapshots.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. Configuring the PySpark Environment&lt;/h2&gt;
&lt;p&gt;To construct a local or cloud based development environment, you must pass specific configuration parameters to the SparkSession builder. The configuration details the jar packages, catalog mappings, and storage directories.&lt;/p&gt;
&lt;p&gt;The following Python script illustrates how to initialize a PySpark session configured to use Apache Iceberg with a local Hadoop catalog.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession

# Initialize SparkSession with Iceberg extensions and a local Hadoop catalog
spark = SparkSession.builder \
    .appName(&amp;quot;IcebergSparkDMLEvolution&amp;quot;) \
    .config(&amp;quot;spark.jars.packages&amp;quot;, &amp;quot;org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2&amp;quot;) \
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.local&amp;quot;, &amp;quot;org.apache.iceberg.spark.SparkCatalog&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.local.type&amp;quot;, &amp;quot;hadoop&amp;quot;) \
    .config(&amp;quot;spark.sql.catalog.local.warehouse&amp;quot;, &amp;quot;/tmp/warehouse&amp;quot;) \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explaining Key Configurations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;spark.jars.packages&lt;/code&gt;&lt;/strong&gt;: Downloads the Iceberg Spark runtime jar file, which matches the Spark version (3.5) and Scala version (2.12). This package contains the reader/writer implementations, metadata parser, and SQL extensions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;spark.sql.extensions&lt;/code&gt;&lt;/strong&gt;: Registers the Iceberg extensions with Spark&apos;s query parser, enabling SQL command modifications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;spark.sql.catalog.local&lt;/code&gt;&lt;/strong&gt;: Defines a new catalog namespace named &lt;code&gt;local&lt;/code&gt;. You can reference tables in this catalog using the prefix &lt;code&gt;local.db_name.table_name&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;spark.sql.catalog.local.type&lt;/code&gt;&lt;/strong&gt;: Configures the catalog to run as a Hadoop catalog, which reads and writes metadata files directly on the local file system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;spark.sql.catalog.local.warehouse&lt;/code&gt;&lt;/strong&gt;: Sets the physical directory path where table folders, data files, and metadata logs are stored.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For cloud deployments using AWS S3, you would append configurations to use &lt;code&gt;S3FileIO&lt;/code&gt; instead of standard file implementations, passing credentials and endpoint URLs as shown below:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Cloud S3 configuration extensions (optional)
# .config(&amp;quot;spark.sql.catalog.local.io-impl&amp;quot;, &amp;quot;org.apache.iceberg.aws.s3.S3FileIO&amp;quot;)
# .config(&amp;quot;spark.sql.catalog.local.s3.endpoint&amp;quot;, &amp;quot;https://s3.amazonaws.com&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;3. Designing Data Schemas and Table Creation&lt;/h2&gt;
&lt;p&gt;To illustrate DML writes and schema modifications, we establish a standard relational database layout. We define the &lt;code&gt;analytics.orders&lt;/code&gt; and &lt;code&gt;analytics.customers&lt;/code&gt; tables. These tables track customer orders and profiles, providing a consistent reference schema for our SQL and PySpark code blocks.&lt;/p&gt;
&lt;h3&gt;Table Schemas&lt;/h3&gt;
&lt;p&gt;The database layout is organized around two key entities:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;analytics.customers&lt;/code&gt;&lt;/strong&gt;: Stores profile information including identifier, name, email address, and country.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;analytics.orders&lt;/code&gt;&lt;/strong&gt;: Stores transaction history including order ID, customer reference ID, transaction date, order amount, and status.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The following Spark SQL script creates these tables inside the local catalog namespace.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Create the customers table */
CREATE TABLE local.analytics.customers (
    customer_id BIGINT,
    name STRING,
    email STRING,
    country STRING
) USING iceberg;

/* Create the orders table partitioned by month and bucketed by customer */
CREATE TABLE local.analytics.orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    amount DECIMAL(10, 2),
    status STRING
) USING iceberg
PARTITIONED BY (month(order_date), bucket(customer_id, 16));
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Partitioning Strategy&lt;/h3&gt;
&lt;p&gt;In the &lt;code&gt;analytics.orders&lt;/code&gt; table creation statement, we configure a partitioning layout using partition transforms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;month(order_date)&lt;/code&gt;&lt;/strong&gt;: Iceberg extracts the year and month from the date, grouping data files into logical partitions (such as &lt;code&gt;2026-05&lt;/code&gt;). This transform speeds up time series queries and prevents partition granularity from becoming too small.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;bucket(customer_id, 16)&lt;/code&gt;&lt;/strong&gt;: Iceberg hashes the customer ID and distributes records across 16 hash buckets. This transform ensures that files are distributed evenly, which optimizes parallel processing during join queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because Iceberg uses hidden partitioning, these partition fields are computed in metadata. Downstream query writers do not need to query the derived partition fields directly, which prevents common filter errors and avoids directory scanning latency.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Writing Data: Append and Overwrite Operations&lt;/h2&gt;
&lt;p&gt;Once tables are created, you can write data using Spark&apos;s SQL interface or Spark DataFrame APIs. In Spark 3.x, DataFrame writes are handled using the DataFrameWriter V2 API, which provides a type safe interface for catalog operations.&lt;/p&gt;
&lt;p&gt;The following Python code illustrates how to load transactional datasets into memory and write them to the catalog tables.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Prepare seed data for customers
customers_data = [
    (101, &amp;quot;Alice Smith&amp;quot;, &amp;quot;alice@example.com&amp;quot;, &amp;quot;USA&amp;quot;),
    (102, &amp;quot;Bob Jones&amp;quot;, &amp;quot;bob@example.com&amp;quot;, &amp;quot;Canada&amp;quot;),
    (103, &amp;quot;Charlie Green&amp;quot;, &amp;quot;charlie@example.com&amp;quot;, &amp;quot;UK&amp;quot;)
]
customers_df = spark.createDataFrame(customers_data, [&amp;quot;customer_id&amp;quot;, &amp;quot;name&amp;quot;, &amp;quot;email&amp;quot;, &amp;quot;country&amp;quot;])

# Append customer records to the customers table
customers_df.writeTo(&amp;quot;local.analytics.customers&amp;quot;).append()

# Prepare seed data for orders
orders_data = [
    (1, 101, &amp;quot;2026-05-15&amp;quot;, 150.50, &amp;quot;Shipped&amp;quot;),
    (2, 102, &amp;quot;2026-05-20&amp;quot;, 89.99, &amp;quot;Processing&amp;quot;),
    (3, 103, &amp;quot;2026-05-22&amp;quot;, 210.00, &amp;quot;Completed&amp;quot;)
]
# Convert order_date column explicitly to date type
from pyspark.sql.functions import col
orders_df = spark.createDataFrame(orders_data, [&amp;quot;order_id&amp;quot;, &amp;quot;customer_id&amp;quot;, &amp;quot;order_date&amp;quot;, &amp;quot;amount&amp;quot;, &amp;quot;status&amp;quot;])
orders_df = orders_df.withColumn(&amp;quot;order_date&amp;quot;, col(&amp;quot;order_date&amp;quot;).cast(&amp;quot;date&amp;quot;))

# Append order records to the orders table
orders_df.writeTo(&amp;quot;local.analytics.orders&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Transaction Isolation and Commits&lt;/h3&gt;
&lt;p&gt;Every append or write operation in Apache Iceberg represents a single, atomic transaction. When Spark tasks finish writing files to object storage, the driver compiles a list of new data files and attempts to commit them by writing a new metadata JSON file.&lt;/p&gt;
&lt;p&gt;This commit process follows optimistic concurrency control rules. If another process commits a change during the write task, Spark retries the transaction by reading the updated catalog pointer and applying the writes to the new state. This design guarantees that readers always observe consistent snapshots, preventing dirty reads or partial writes from exposing corrupted records.&lt;/p&gt;
&lt;h3&gt;Dynamic Partition Overwrites&lt;/h3&gt;
&lt;p&gt;When updating data tables, data engineers often need to overwrite data within specific partition ranges. In legacy Hive structures, overwriting a partition required deleting folder directories manually, which risked data loss if queries failed mid process.&lt;/p&gt;
&lt;p&gt;Iceberg resolves this using metadata overwrites. By enabling dynamic partition overwrite mode, Spark replaces data files only in the partitions affected by the incoming write set:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Configure Spark to use dynamic partition overwrite mode
spark.conf.set(&amp;quot;spark.sql.sources.partitionOverwriteMode&amp;quot;, &amp;quot;dynamic&amp;quot;)

# Overwrite orders data for May 2026 without altering historical files in other months
new_orders_df = spark.createDataFrame([
    (2, 102, &amp;quot;2026-05-20&amp;quot;, 95.00, &amp;quot;Shipped&amp;quot;) # Updated record
], [&amp;quot;order_id&amp;quot;, &amp;quot;customer_id&amp;quot;, &amp;quot;order_date&amp;quot;, &amp;quot;amount&amp;quot;, &amp;quot;status&amp;quot;])
new_orders_df = new_orders_df.withColumn(&amp;quot;order_date&amp;quot;, col(&amp;quot;order_date&amp;quot;).cast(&amp;quot;date&amp;quot;))

new_orders_df.writeTo(&amp;quot;local.analytics.orders&amp;quot;).overwritePartitions()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When this operation completes, Iceberg registers a new snapshot. The table pointers are updated so that queries for May 2026 resolve to the new file layout, while older partitions (such as April or March) remain unchanged and active.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;5. Transactional Upserts with MERGE INTO&lt;/h2&gt;
&lt;p&gt;Data pipelines often process streaming updates or change data capture logs that must be integrated into target tables. Performing these modifications row by row in legacy data lakes required rewriting entire tables. Iceberg solves this by supporting the Spark SQL &lt;code&gt;MERGE INTO&lt;/code&gt; statement.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;MERGE INTO&lt;/code&gt; statement allows engineers to perform upserts, modifying matching records and inserting new records in a single transactional step.&lt;/p&gt;
&lt;h3&gt;SQL Upsert Example&lt;/h3&gt;
&lt;p&gt;The following SQL command merges an incremental update dataset into the &lt;code&gt;analytics.orders&lt;/code&gt; table, updating order status values and inserting new transactions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Create a staging table containing incremental updates */
CREATE TABLE local.analytics.orders_stage (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    amount DECIMAL(10, 2),
    status STRING
) USING iceberg;

/* Insert sample updates into staging */
INSERT INTO local.analytics.orders_stage VALUES
(2, 102, CAST(&apos;2026-05-20&apos; AS DATE), 95.00, &apos;Completed&apos;), /* Update status of order 2 */
(4, 101, CAST(&apos;2026-05-22&apos; AS DATE), 450.00, &apos;Processing&apos;); /* Insert new order 4 */

/* Merge staging records into target table */
MERGE INTO local.analytics.orders AS target
USING local.analytics.orders_stage AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
  UPDATE SET target.amount = source.amount, target.status = source.status
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, order_date, amount, status)
  VALUES (source.order_id, source.customer_id, source.order_date, source.amount, source.status);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Matching Logic and Predicate Evaluation&lt;/h3&gt;
&lt;p&gt;When executing a &lt;code&gt;MERGE INTO&lt;/code&gt; query, Spark translates the SQL logic into a physical plan:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Join Predicates&lt;/strong&gt;: Spark performs a join operation between the target table and the source staging table using the specified key column (&lt;code&gt;order_id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Classification&lt;/strong&gt;: Rows that match the join key are routed to the update engine block, while unmatched staging records are routed to the insert writer block.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Alignment&lt;/strong&gt;: When the writes finish, Iceberg generates a new metadata snapshot that incorporates both the modified files and the newly appended files in a single atomic transaction.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2&gt;6. Under the Hood: Copy on Write vs. Merge on Read&lt;/h2&gt;
&lt;p&gt;To balance write latency and read execution speed, Apache Iceberg supports two distinct write modes: &lt;strong&gt;Copy on Write (CoW)&lt;/strong&gt; and &lt;strong&gt;Merge on Read (MoR)&lt;/strong&gt;. You can configure these modes on a per table basis using table properties.&lt;/p&gt;
&lt;h3&gt;Copy on Write Mode&lt;/h3&gt;
&lt;p&gt;Copy on Write is the default mode for Iceberg tables. When a write task updates or deletes rows inside a data file, the write engine reads the source data file, applies the updates in memory, and writes the entire data set back as a new Parquet data file.&lt;/p&gt;
&lt;p&gt;This process isolates mutations at the file level:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Query planning remains simple. Query engines scan only the active data files without performing runtime join logic. This configuration delivers optimal read performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Significant write amplification. Updating a single row inside a 512 MB Parquet file requires writing a new 512 MB file, which consumes write I/O and storage resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Merge on Read Mode&lt;/h3&gt;
&lt;p&gt;Merge on Read minimizes write amplification by leaving the source data files unmodified during an update or delete. Instead of rewriting the data file, the write engine writes the changed data rows into new data files and records the location of modified rows inside separate &lt;strong&gt;delete files&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;These delete files are cataloged in the manifest metadata and are merged with data files at query execution time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Writes are fast and require minimal I/O. This configuration is ideal for high frequency streaming ingestion or live CDC feeds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Increased read latency. Query engines must read the delete files and merge them with data files dynamically, which consumes CPU and memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Positional vs. Equality Deletes&lt;/h3&gt;
&lt;p&gt;Merge on Read supports two formats for tracking deleted or updated records:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Positional Deletes&lt;/strong&gt;: The delete file contains the target data file path and the absolute row position offsets (indexes) of the deleted rows. This format is efficient because readers can seek directly to the offsets during a file scan.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Equality Deletes&lt;/strong&gt;: The delete file contains the value of key columns (such as &lt;code&gt;order_id = 2&lt;/code&gt;). When reading, the engine must perform a join operation on the key columns, which requires building a hash table in memory.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Dremio Positional Delete Caching&lt;/h3&gt;
&lt;p&gt;To minimize the read latency associated with Merge on Read tables, high performance query engines like Dremio implement &lt;strong&gt;Positional Delete Caching&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;When a query scans a partition containing positional delete files, Dremio&apos;s Sabot execution engine decodes the row offsets and caches them as in memory bitmaps on the executor nodes. As the columnar reader scans data blocks from Parquet files, it references this delete bitmap directly. The engine skips the deleted row indexes during the vectorized Apache Arrow buffer projection.&lt;/p&gt;
&lt;p&gt;This caching design eliminates the need to read delete files repeatedly from object storage for concurrent queries. It also avoids row by row join evaluations, allowing MoR tables to achieve sub second query latencies close to CoW tables.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;7. Schema Evolution on the Fly in Spark&lt;/h2&gt;
&lt;p&gt;A common source of failures in legacy architectures is managing schema changes. If a database schema changes, downstream pipelines often break. Apache Iceberg solves this by using immutable column IDs, allowing safe schema evolution without physical data modifications.&lt;/p&gt;
&lt;p&gt;In Spark, schema changes can be executed using SQL commands or automatically during PySpark DataFrame writes by enabling schema merging.&lt;/p&gt;
&lt;h3&gt;SQL Alterations in Spark&lt;/h3&gt;
&lt;p&gt;You can execute alterations on the &lt;code&gt;analytics.orders&lt;/code&gt; table directly using Spark SQL. These commands modify metadata configuration records without rewriting data files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Add a new column to track discount rates */
ALTER TABLE local.analytics.orders ADD COLUMN discount_rate DOUBLE;

/* Rename the status column to transaction_status */
ALTER TABLE local.analytics.orders RENAME COLUMN status TO transaction_status;

/* Drop the discount_rate column from the active layout */
ALTER TABLE local.analytics.orders DROP COLUMN discount_rate;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because Iceberg tracks fields using unique column IDs, dropping a column does not require removing bytes from physical files. The catalog removes the column ID from the active schema definition, and readers ignore the column block during file scans.&lt;/p&gt;
&lt;h3&gt;Schema Merging during DataFrame Writes&lt;/h3&gt;
&lt;p&gt;If your applications produce datasets with varying schemas, you can configure Spark to merge these changes automatically into the target table during write operations by setting the &lt;code&gt;mergeSchema&lt;/code&gt; option to &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Create a DataFrame containing an evolved customers schema
evolved_data = [
    (104, &amp;quot;David Miller&amp;quot;, &amp;quot;david@example.com&amp;quot;, &amp;quot;Germany&amp;quot;, &amp;quot;Gold&amp;quot;) # Contains new column &apos;tier&apos;
]
evolved_df = spark.createDataFrame(evolved_data, [&amp;quot;customer_id&amp;quot;, &amp;quot;name&amp;quot;, &amp;quot;email&amp;quot;, &amp;quot;country&amp;quot;, &amp;quot;tier&amp;quot;])

# Append data and merge the new &apos;tier&apos; column into the analytics.customers table
evolved_df.writeTo(&amp;quot;local.analytics.customers&amp;quot;) \
    .option(&amp;quot;mergeSchema&amp;quot;, &amp;quot;true&amp;quot;) \
    .append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When this write task completes, Iceberg reads the incoming DataFrame schema, detects the new column &lt;code&gt;tier&lt;/code&gt;, assigns it a new unique column ID, appends it to the active schema, and commits the transaction. Older files are not rewritten. When read, they return null values for the &lt;code&gt;tier&lt;/code&gt; field.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;8. Spark Performance Tuning for Iceberg Writes&lt;/h2&gt;
&lt;p&gt;To prevent file fragmentation and ensure optimal query performance, data engineers must tune how Spark writes data files to Iceberg tables.&lt;/p&gt;
&lt;h3&gt;Write Distribution Modes&lt;/h3&gt;
&lt;p&gt;When Spark writes data across multiple parallel tasks, it can distribute rows arbitrarily. This arbitrary distribution can lead to a single task writing small files to hundreds of partitions, which degrades storage performance.&lt;/p&gt;
&lt;p&gt;You can control this behavior using the &lt;code&gt;write.distribution-mode&lt;/code&gt; table property:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Configure write distribution on the orders table */
ALTER TABLE local.analytics.orders SET TBLPROPERTIES (
    &apos;write.distribution-mode&apos; = &apos;hash&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The available write distribution modes are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;none&lt;/code&gt;&lt;/strong&gt;: Spark writes rows directly without repartitioning. This mode has low write latency but can generate thousands of small files if rows are not sorted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;hash&lt;/code&gt;&lt;/strong&gt;: Spark clusters rows by partition keys using a hash partitioner before writing them. This mode minimizes the number of active file writers and prevents small file fragmentation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;sort&lt;/code&gt;&lt;/strong&gt;: Spark sorts the rows by partition keys and sorting specifications before writing. This mode optimizes Parquet column compression and improves read speeds, but increases CPU usage during writes.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Target File Sizes&lt;/h3&gt;
&lt;p&gt;You can configure the target file size for writes using table properties. For Parquet files, a target size between 128 MB and 512 MB is recommended to balance query parallelization and file listing overhead:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Set the target file size to 256 MB */
ALTER TABLE local.analytics.orders SET TBLPROPERTIES (
    &apos;write.parquet.compression-codec&apos; = &apos;zstd&apos;,
    &apos;write.target-file-size-bytes&apos; = &apos;268435456&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By using the Z standard (zstd) compression codec and setting a target file size of 256 MB, you ensure that Spark writes highly compressed files that are optimal for cloud object storage scans.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;9. Querying and Accelerating Spark-Written Tables via Dremio&lt;/h2&gt;
&lt;p&gt;While Apache Spark is optimized for batch writing and heavy transformations, serving analytical reports and BI dashboards requires a query engine that can deliver sub second response times. Once Spark commits data to the Iceberg catalog, Dremio can query the tables directly.&lt;/p&gt;
&lt;p&gt;Dremio accelerates reads over evolved and partitioned Iceberg tables using key architectural optimizations:&lt;/p&gt;
&lt;h3&gt;The Sabot Vectorized Engine&lt;/h3&gt;
&lt;p&gt;Dremio bypasses JVM execution pipelines by loading Parquet data directly into in memory Apache Arrow record batches. The Sabot engine processes these columnar Arrow arrays using CPU register vectorization.&lt;/p&gt;
&lt;p&gt;If Spark has evolved a table schema, Dremio&apos;s vectorized Arrow projector handles the changes in memory. For missing columns in older files, Dremio projects null vectors directly. For promoted types, it executes vectorized sign extensions in the CPU registers. This design avoids row by row serialization loops, maintaining fast query execution.&lt;/p&gt;
&lt;h3&gt;Dynamic Metadata Caching&lt;/h3&gt;
&lt;p&gt;On cloud storage networks, listing directories to plan queries introduces high latency. Dremio eliminates this overhead by caching the Iceberg metadata JSON files, partition specifications, and manifest lists locally on its coordinator nodes.&lt;/p&gt;
&lt;p&gt;When a query is submitted, Dremio reads the cached metadata to locate the target Parquet files. This local metadata resolution allows Dremio to plan queries in milliseconds, avoiding remote HTTP storage calls.&lt;/p&gt;
&lt;h3&gt;Data Reflections&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s Data Reflections provide pre-computed materializations (stored as Iceberg tables) that Dremio queries automatically to accelerate analytical workloads.&lt;/p&gt;
&lt;p&gt;If Spark modifies a table schema or partitioning specification, Dremio&apos;s query compiler automatically updates the mapping logic. The compiler determines whether the reflection can satisfy the query predicate, rewriting execution paths on the fly.&lt;/p&gt;
&lt;p&gt;This automatic redirection delivers sub second query latencies for BI dashboards without requiring database administrators to rebuild materializations or update user SQL queries.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;10. Deep Dive: Catalyst Optimizer Integrations and DSv2 Internals&lt;/h2&gt;
&lt;p&gt;Understanding how Spark integrates with Iceberg at a compiler level is crucial for building resilient data architectures. When you add the &lt;code&gt;IcebergSparkSessionExtensions&lt;/code&gt; to your Spark configuration, Spark replaces its standard logical planning strategies with custom Iceberg implementations.&lt;/p&gt;
&lt;p&gt;In standard Spark operations, writing to a file format like Parquet relies on the legacy DataSourceV1 API, which executes writes row-by-row through an execution plan that is opaque to the transactional store. Under the DataSourceV2 (DSv2) framework, the write process is negotiated between Spark&apos;s Catalyst Optimizer and the Iceberg library through formal interfaces.&lt;/p&gt;
&lt;p&gt;When Spark compiles a write plan, the Catalyst Optimizer evaluates the query. If the target is an Iceberg table, it transforms the logical plan into a &lt;code&gt;WriteToDataSourceV2&lt;/code&gt; node. This node coordinates with Iceberg&apos;s &lt;code&gt;SparkWrite&lt;/code&gt; class to determine how data will be cataloged and distributed across executor nodes.&lt;/p&gt;
&lt;p&gt;During the execution phase, Spark tasks running on separate executors write their partition blocks to temporary data files in object storage. Each task generates a list of &lt;code&gt;DataFile&lt;/code&gt; metadata entries containing the physical file paths, file sizes, partition values, row counts, and column-level min/max statistics. These metadata records are returned to the driver node at the end of the write stage.&lt;/p&gt;
&lt;p&gt;Once the driver collects all task results, it initiates the commit phase. The Iceberg transaction manager updates the table metadata by executing the following actions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Read Active Snapshot&lt;/strong&gt;: The catalog retrieves the current metadata file pointer to resolve the table state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify Concurrency&lt;/strong&gt;: Under optimistic concurrency rules, the manager checks if another writer has committed a new snapshot since this write task began. If a conflict is detected, Iceberg attempts to reconcile the change (for example, verifying if an append is non-overlapping with a concurrent delete).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate Manifest File&lt;/strong&gt;: A new manifest file is created to catalog the newly written Parquet files along with their column-level statistics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Update Manifest List&lt;/strong&gt;: Iceberg writes a new manifest list file, which acts as an index pointing to all active manifest files for the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write Table Metadata&lt;/strong&gt;: A new table metadata JSON file is written, containing the schema configuration, partition spec, and the reference ID of the new snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Swap Catalog Pointer&lt;/strong&gt;: The catalog performs an atomic swap operation (such as a database compare-and-swap or filesystem rename) to update the current pointer to the new metadata JSON file.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By using this structured commit process, Iceberg ensures that Spark writes are fully transaction-safe. The physical Parquet files are only visible to readers after the catalog pointer swap completes. If a Spark task fails mid-execution, the temporary files are ignored by readers and cleaned up during orphan file maintenance, preventing partial or corrupted data from corrupting analytical queries.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;11. Multi-Catalog Configurations and Cloud Setup Nuances&lt;/h2&gt;
&lt;p&gt;In enterprise environments, data lakes rarely span a single catalog. It is common to query data across multiple environments, such as integrating an AWS Glue catalog with a local developer catalog or an open REST catalog like Apache Polaris.&lt;/p&gt;
&lt;p&gt;Spark&apos;s catalog configuration rules allow you to define multiple active catalogs within the same session. By prefixing catalog names to Spark properties, you configure independent endpoints, authentication credentials, and storage backends.&lt;/p&gt;
&lt;p&gt;The following configurations illustrate how to register an AWS Glue catalog, a Nessie REST catalog, and a local Hadoop catalog in a single SparkSession configuration setup:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-properties&quot;&gt;# Local Hadoop Catalog Config
spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type=hadoop
spark.sql.catalog.local.warehouse=/tmp/warehouse

# AWS Glue Catalog Config
spark.sql.catalog.aws_glue=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.aws_glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.aws_glue.warehouse=s3://my-enterprise-bucket/warehouse
spark.sql.catalog.aws_glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO

# Project Nessie REST Catalog Config
spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog
spark.sql.catalog.nessie.uri=http://localhost:19120/api/v1
spark.sql.catalog.nessie.ref=main
spark.sql.catalog.nessie.warehouse=s3://my-enterprise-bucket/nessie-warehouse
spark.sql.catalog.nessie.io-impl=org.apache.iceberg.aws.s3.S3FileIO
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Addressing Cloud-Specific Storage Nuances&lt;/h3&gt;
&lt;p&gt;When writing data to S3 or Google Cloud Storage, standard Hadoop filesystem configurations can introduce significant performance bottlenecks. To bypass these legacy limitations, Iceberg implements native FileIO interfaces such as &lt;code&gt;S3FileIO&lt;/code&gt; and &lt;code&gt;GCSFileIO&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;These native implementations offer several operational benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Direct API Calls&lt;/strong&gt;: Bypasses the legacy Hadoop FileSystem wrapper, executing direct cloud storage API commands for file writes and catalog metadata reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized Reads&lt;/strong&gt;: Supports range reads to fetch specific Parquet column footer metadata blocks in parallel, reducing network I/O overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multipart Uploads&lt;/strong&gt;: Optimizes high-throughput writes by streaming file blocks in parallel to cloud storage, preventing memory exhaustion on executor nodes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential Vending Integration&lt;/strong&gt;: Interfaces with REST catalogs to request temporary cloud storage access credentials, eliminating the need to distribute static IAM keys to Spark clusters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By combining multi-catalog configurations with native cloud FileIO layers, data engineers can build hybrid lakehouse architectures that span local testing sandboxes, cloud warehouses, and secure REST catalogs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;12. Advanced MERGE INTO Execution Mechanics&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;MERGE INTO&lt;/code&gt; statement is one of the most complex SQL query operations Spark executes over Iceberg tables. To manage writes efficiently, data engineers must configure how Spark performs these join operations.&lt;/p&gt;
&lt;p&gt;When Spark compiles a merge query, it evaluates the update and insert conditions and determines how to match rows between the source and target datasets. Depending on the table size and sorting properties, Spark selects one of two join execution strategies:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Broadcast Hash Join&lt;/strong&gt;: If the source staging table is small (such as an incremental change capture log of a few megabytes), Spark broadcasts the staging table to all executor nodes. This strategy avoids sorting or partitioning the target table, executing the merge operation in a single stage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shuffle Hash Join&lt;/strong&gt;: If both the target table and the source table are large, Spark executes a full shuffle join. It repartitions both datasets by the merge join key across the cluster network. This repartitioning step ensures that matching records are routed to the same executor nodes for evaluation.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Tuning Write Amplification for Merge Queries&lt;/h3&gt;
&lt;p&gt;Merge queries can generate significant write amplification if the target tables are not sorted or partitioned correctly. You can tune these operations by configuring the target table properties:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Optimize merge query performance */
ALTER TABLE local.analytics.orders SET TBLPROPERTIES (
    &apos;write.merge.mode&apos; = &apos;merge-on-read&apos;,
    &apos;write.update.mode&apos; = &apos;merge-on-read&apos;,
    &apos;write.delete.mode&apos; = &apos;merge-on-read&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default, updates and deletes execute in Copy-on-Write mode, which rewrites complete Parquet files even for minor changes. Setting the write mode to Merge-on-Read directs Spark to append delete files instead.&lt;/p&gt;
&lt;p&gt;To optimize read speeds after high-frequency updates, you should run regular compaction routines to consolidate delete files and merge them back into the base Parquet format:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;/* Compact orders table partitions to merge delete files */
CALL local.system.rewrite_data_files(
    table =&amp;gt; &apos;analytics.orders&apos;,
    strategy =&amp;gt; &apos;sort&apos;,
    sort_order =&amp;gt; &apos;order_id ASCNullsLast&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This compaction call reads the data and delete files, applies all updates, and writes optimized Parquet files back to storage, restoring optimal read speeds for downstream query engines like Dremio.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;13. Troubleshooting Common Spark and Iceberg Errors&lt;/h2&gt;
&lt;p&gt;Integrating Spark and Iceberg can lead to specific configuration and runtime exceptions. Understanding these errors helps diagnose issues quickly.&lt;/p&gt;
&lt;h3&gt;1. ClassNotFoundException: SparkCatalog&lt;/h3&gt;
&lt;p&gt;If you receive a ClassNotFoundException for &lt;code&gt;org.apache.iceberg.spark.SparkCatalog&lt;/code&gt; when starting a SparkSession, Spark is unable to locate the runtime jar files on its execution classpath.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Root Cause&lt;/strong&gt;: The Iceberg runtime package is missing from Spark&apos;s executor or driver classpath libraries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;: Verify that the &lt;code&gt;--packages&lt;/code&gt; flag is correctly specified or that the jar file is present in Spark&apos;s default jar folder. Ensure that the package version matches the Spark version and Scala version exactly (for example, &lt;code&gt;iceberg-spark-runtime-3.5_2.12&lt;/code&gt; for Spark 3.5).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. CommitFailedException: Concurrent Modification&lt;/h3&gt;
&lt;p&gt;This error occurs when multiple Spark write tasks attempt to commit changes to the same Iceberg table snapshot simultaneously.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Root Cause&lt;/strong&gt;: Optimistic concurrency control validation failed because the catalog reference pointer has been modified by another process.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;: Increase the number of catalog commit retries by setting &lt;code&gt;&apos;commit.retry.num-retries&apos; = &apos;10&apos;&lt;/code&gt; in the table properties. Alternatively, structure orchestration pipelines to avoid concurrent write processes targeting the same table.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. AnalysisException: Cannot write incompatible data to table&lt;/h3&gt;
&lt;p&gt;This validation exception is raised when the schema of the incoming DataFrame does not match the target Iceberg table structure.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Root Cause&lt;/strong&gt;: Type promotion rules were violated or columns were missing in the DataFrame without enabling schema merging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;: Cast columns explicitly to the target schema types before appending. If adding new fields, set &lt;code&gt;.option(&amp;quot;mergeSchema&amp;quot;, &amp;quot;true&amp;quot;)&lt;/code&gt; or execute an &lt;code&gt;ALTER TABLE&lt;/code&gt; query first to define the new fields in the metadata.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;14. Summary and Best Practices Checklist&lt;/h2&gt;
&lt;p&gt;Integrating Apache Spark with Apache Iceberg allows organizations to build reliable, scalable data platforms. By managing writes in metadata and tracking column references with immutable IDs, Iceberg prevents data corruption and simplifies schema management.&lt;/p&gt;
&lt;p&gt;To maintain performance, data engineers should follow this operational checklist:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Configure SQL Extensions&lt;/strong&gt;: Ensure &lt;code&gt;IcebergSparkSessionExtensions&lt;/code&gt; is loaded to enable commands like &lt;code&gt;MERGE INTO&lt;/code&gt; and &lt;code&gt;UPDATE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Align Partition Specifications&lt;/strong&gt;: Use logical partition transforms like &lt;code&gt;month()&lt;/code&gt; or &lt;code&gt;bucket()&lt;/code&gt; to optimize file layout and prune queries automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Select the Right Write Mode&lt;/strong&gt;: Deploy Copy on Write (CoW) tables for read-heavy analytical workloads. Use Merge on Read (MoR) tables for high frequency streaming ingestion or live CDC pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manage File Fragmentation&lt;/strong&gt;: Set the target file size property (&lt;code&gt;write.target-file-size-bytes&lt;/code&gt;) to 256 MB or 512 MB, and set the write distribution mode to &lt;code&gt;hash&lt;/code&gt; or &lt;code&gt;sort&lt;/code&gt; to prevent small file generation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploy Dremio for BI serving&lt;/strong&gt;: Run Dremio over Spark written Iceberg tables to accelerate query execution. Use Dremio&apos;s vectorized Arrow reader, metadata caching, and reflections to deliver sub second response times.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building a Multicloud Agentic Lakehouse Reference Architecture</title><link>https://iceberglakehouse.com/posts/2026-05-22-multicloud-agentic-lakehouse-reference-architecture/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-multicloud-agentic-lakehouse-reference-architecture/</guid><description>
Artificial intelligence has evolved past static retrieval-augmented generation chatbots. Organizations are now deploying autonomous **AI Agents** tha...</description><pubDate>Fri, 22 May 2026 09:30:00 GMT</pubDate><content:encoded>&lt;p&gt;Artificial intelligence has evolved past static retrieval-augmented generation chatbots. Organizations are now deploying autonomous &lt;strong&gt;AI Agents&lt;/strong&gt; that can analyze requests, design plans, query data lakes, and execute downstream operations.&lt;/p&gt;
&lt;p&gt;However, when developers connect AI agents to traditional enterprise data platforms, they encounter critical barriers. Traditional data warehouses are built for human business intelligence analysts who write predictable SQL queries. AI agents generate queries dynamically, require sub-second response times for iterative reasoning loops, and must operate under strict security boundaries to prevent data exfiltration.&lt;/p&gt;
&lt;p&gt;To solve these challenges, organizations are adopting the &lt;strong&gt;Agentic Lakehouse&lt;/strong&gt; architecture. This reference architecture describes an open, multicloud data lakehouse specifically optimized for autonomous AI agents. The stack is anchored by &lt;strong&gt;Apache Iceberg&lt;/strong&gt; as the open storage standard, &lt;strong&gt;Apache Polaris&lt;/strong&gt; as the cross-cloud REST catalog, and &lt;strong&gt;Dremio&lt;/strong&gt; as the semantic and query acceleration layer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. The AI Agent Data Bottleneck&lt;/h2&gt;
&lt;p&gt;To understand why a dedicated architecture is necessary, we must examine the specific issues that occur when an AI agent interacts with standard data infrastructure.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                     AGENT WORKFLOW                     │
│               [ User: &amp;quot;Analyze Sales&amp;quot; ]                │
│                           │                            │
│                           ▼                            │
│                  [ Reason &amp;amp; Plan ]                     │
│                           │                            │
│              ┌────────────┴────────────┐               │
│              ▼                         ▼               │
│       [ Generate SQL ]          [ Execute Tool ]       │
│              │                         │               │
│              ▼                         ▼               │
│       [ Query Engine ]          [ Action Loop ]        │
└──────────────┬─────────────────────────┬───────────────┘
               │ (Wait for DB)           │ (Write back)
               ▼                         ▼
┌────────────────────────────────────────────────────────┐
│                TRADITIONAL DATA PLATFORM               │
│   - Low-context tables (tbl_sales_v2)                  │
│   - Slow JDBC/ODBC serialization                       │
│   - Coarse access controls (All or Nothing)            │
└────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Non-Deterministic Query Generation&lt;/h3&gt;
&lt;p&gt;When a human analyst writes a SQL query, they inspect the database schema, identify the foreign keys, and write a structured join. An AI agent uses a Large Language Model (LLM) to generate SQL queries on the fly based on text descriptions of the database.&lt;/p&gt;
&lt;p&gt;If the database schema is disorganized, uses cryptic column names (such as &amp;lt;code&amp;gt;c_adr_id_fk&amp;lt;/code&amp;gt;), or lacks rich metadata, the agent will generate incorrect joins or hallucinated column names, causing the query to fail. Agents require a structured semantic layer that translates raw database layouts into clean, documented business concepts.&lt;/p&gt;
&lt;h3&gt;Latency Accumulation in Reason-Action Loops&lt;/h3&gt;
&lt;p&gt;Autonomous agents use cognitive architectures like the ReAct (Reasoning and Action) pattern. Instead of running a single query, the agent may execute a multi-step loop:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Look up user information.&lt;/li&gt;
&lt;li&gt;Query purchase history.&lt;/li&gt;
&lt;li&gt;Compare purchases with regional trends.&lt;/li&gt;
&lt;li&gt;Calculate fraud risk scores.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If each query takes five to ten seconds to complete due to query planning or serialization delays, the end-to-end agent loop can take over thirty seconds, creating an unacceptable user experience. Agents require sub-second query response times to complete multi-step reasoning tasks.&lt;/p&gt;
&lt;h3&gt;Fine-Grained Security and Data Leakage&lt;/h3&gt;
&lt;p&gt;Traditional database security relies on granting broad permissions to service accounts. If you grant an AI agent access to a database via a general service account, the agent can potentially query any table, read sensitive columns, or scan the entire dataset.&lt;/p&gt;
&lt;p&gt;If the agent’s prompt is manipulated (prompt injection), the agent could be instructed to dump private customer data or overwrite table configurations. Agents require strict, granular access control down to the row and column level, enforced at the query engine level, to guarantee data security.&lt;/p&gt;
&lt;h3&gt;The Multicloud Reality&lt;/h3&gt;
&lt;p&gt;Modern enterprises do not keep all their data or AI tools in a single cloud. You may run machine learning pipelines on Google Cloud Platform (GCP) Vertex AI, query transaction records stored on Amazon Web Services (AWS) S3, and deploy customer-facing agents on Microsoft Azure.&lt;/p&gt;
&lt;p&gt;Moving hundreds of gigabytes of data between clouds to support local AI models is cost-prohibitive due to egress fees. The data must remain in place and be queried where it lies, using a federated, multicloud metadata catalog.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. The Storage and Metadata Foundation: Apache Iceberg and Apache Polaris&lt;/h2&gt;
&lt;p&gt;The physical storage and catalog layers of the reference architecture must support multi-engine access and cross-cloud query execution without creating data silos.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                     COMPUTE ENGINES                    │
│    ┌──────────────┐ ┌───────────────┐ ┌────────────┐   │
│    │ Dremio (SQL) │ │ Apache Spark  │ │ Python/ML  │   │
│    └──────┬───────┘ └───────┬───────┘ └─────┬──────┘   │
└───────────┼─────────────────┼───────────────┼──────────┘
            │                 │               │
┌───────────▼─────────────────▼───────────────▼──────────┐
│                  REST CATALOG ROUTER                   │
│                   [ Apache Polaris ]                   │
│   - Validates engine identity and OAuth2 tokens        │
│   - Vends short-lived S3 access credentials            │
└─────────────────────────────┬──────────────────────────┘
                              │
┌─────────────────────────────▼──────────────────────────┐
│                     STORAGE LAYER                      │
│             [Cloud Object Storage (S3/ADLS)]           │
│             [Apache Iceberg Table Metadata]            │
└────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Unified Open Table Format: Apache Iceberg&lt;/h3&gt;
&lt;p&gt;To prevent vendor lock-in and support diverse engines, the data lakehouse stores all files as &lt;strong&gt;Apache Iceberg&lt;/strong&gt; tables. Iceberg provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Ensures that data written by real-time streaming pipelines (e.g., Flink) is committed atomically, making it instantly visible to analytical engines (e.g., Dremio) without read-write conflicts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Speeds up query planning by automatically translating natural queries (like timestamp ranges) into optimized partition filters, ensuring that agent-generated queries do not trigger full table scans.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema and Partition Evolution:&lt;/strong&gt; Allows the database schema and partitioning strategies to evolve over time without requiring table rewrites.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cross-Cloud Routing: Apache Polaris&lt;/h3&gt;
&lt;p&gt;To coordinate table state across multiple clouds, we deploy &lt;strong&gt;Apache Polaris&lt;/strong&gt; as our open REST Catalog. Polaris operates as a lightweight, stateless catalog manager:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Single Catalog Registry:&lt;/strong&gt; Polaris manages pointers for all Iceberg tables across AWS S3, Azure Data Lake Storage, and GCP Cloud Storage. It allows query engines in any cloud to resolve table paths using a single API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential Vending for Security:&lt;/strong&gt; When a query engine requests the location of an Iceberg table, it authenticates with Polaris using OAuth2 client credentials. Polaris validates the request and communicates with the cloud provider (e.g., AWS STS) to generate short-lived, read-only security credentials for the specific table path. The query engine never has permanent read or write access to the raw S3 bucket, preventing credentials from being leaked or abused.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Interoperability:&lt;/strong&gt; Polaris supports the open-source Iceberg REST catalog specification. This ensures that Dremio, Snowflake, Spark, Flink, and Python engines can query the same metadata registry, preventing catalog fragmentation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Credential Vending Protocols and AWS IAM Integration&lt;/h3&gt;
&lt;p&gt;To understand how Polaris secures object storage, we can trace the credential vending handshake. When Dremio attempts to plan a query over a table, it does not use a global AWS access key. Instead, the transaction follows a strict sequence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Token Exchange:&lt;/strong&gt; The engine sends an OAuth2 token request to Polaris using client credentials configured for that engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Control Resolution:&lt;/strong&gt; Polaris verifies the client credentials and checks if the mapped principal has the &lt;code&gt;catalog_read&lt;/code&gt; privilege on the requested namespace.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AssumeRole Handshake:&lt;/strong&gt; Polaris contacts the AWS Security Token Service (STS) endpoint using an IAM AssumeRole API call. Polaris passes a session policy that restricts access exclusively to the table&apos;s S3 location (for example, &lt;code&gt;s3://lakehouse-warehouse/db/user_events/&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential Injection:&lt;/strong&gt; AWS STS returns a set of temporary, scoped security credentials (access key, secret key, and session token) that expire after a short duration (typically one hour).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Scan:&lt;/strong&gt; Polaris sends these credentials back to Dremio along with the table&apos;s metadata location. Dremio uses the temporary keys to stream the Parquet blocks directly from S3.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process ensures that Dremio is never exposed to keys that could read other directories in the bucket.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;3. The Semantic Layer: Dremio&lt;/h2&gt;
&lt;p&gt;The semantic layer bridges the gap between raw database storage and the AI agent&apos;s reasoning engine. &lt;strong&gt;Dremio&lt;/strong&gt; serves as the unified semantic and query acceleration layer in this reference architecture.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                     CLIENT SYSTEM                      │
│      ┌──────────────────────────────────────────┐      │
│      │  AI Agent (Python Framework/LlamaIndex)  │      │
│      └────────────────────┬─────────────────────┘      │
└───────────────────────────┼────────────────────────────┘
                            │ (Arrow Flight SQL / TCP Stream)
┌───────────────────────────▼────────────────────────────┐
│                    SEMANTIC LAYER                      │
│                  [ Dremio Platform ]                   │
│   - Semantic Mapping (Virtual Datasets)                │
│   - Dynamic SQL Reflections (Acceleration)             │
│   - Row/Column Masking Policies                        │
└────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Business Context Mapping (Virtual Datasets)&lt;/h3&gt;
&lt;p&gt;Dremio allows data architects to define &lt;strong&gt;Virtual Datasets&lt;/strong&gt;. These are clean logical abstractions of raw tables that do not duplicate the underlying data. Dremio’s semantic features include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Human-Readable Schemas:&lt;/strong&gt; Cryptic table layouts are mapped to intuitive business hierarchies (e.g., &amp;lt;code&amp;gt;Enterprise_Data.Customer_Success.Active_Subscribers&amp;lt;/code&amp;gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rich Documentation Caching:&lt;/strong&gt; Descriptions, tags, and data types are attached directly to columns in the semantic layer. When the AI agent scans the schema, it reads these descriptions as structured prompt context, ensuring it understands the meaning of each column.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pre-Joined Relationships:&lt;/strong&gt; Complex joins are defined as virtual views. The agent can query a single dataset without needing to reconstruct multi-table join syntax, reducing query errors.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Arrow Flight SQL for Sub-Second Latency&lt;/h3&gt;
&lt;p&gt;Traditional database connections use Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) protocols. These protocols serialize data into row-by-row representations, creating a network transfer bottleneck when moving large datasets.&lt;/p&gt;
&lt;p&gt;Dremio supports &lt;strong&gt;Apache Arrow Flight SQL&lt;/strong&gt;, an open-source protocol built for high-speed columnar data transfer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Vectorized Data Streaming:&lt;/strong&gt; Flight SQL streams data directly from Dremio’s memory to the AI agent’s Python environment in columnar Arrow buffers. This eliminates the serialization and deserialization steps required by JDBC.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parallel TCP Streams:&lt;/strong&gt; Flight SQL can distribute the data transfer across multiple network streams, allowing large result sets to be loaded into Python in milliseconds, which accelerates the agent&apos;s internal analysis steps.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Query Acceleration: Reflections&lt;/h3&gt;
&lt;p&gt;To support real-time interactive BI and rapid agent loops, Dremio utilizes &lt;strong&gt;SQL Reflections&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Autonomous Query Acceleration:&lt;/strong&gt; Reflections are optimized materializations of data layouts (such as aggregations or sorted partitions) stored as Apache Iceberg tables in the warehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost-Based Plan Rewriting:&lt;/strong&gt; When an agent submits a query, Dremio’s compiler evaluates the query and automatically rewrites the execution plan to scan the reflection instead of the raw table. The agent gets query responses in milliseconds without needing to modify its SQL syntax.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Background Synchronization:&lt;/strong&gt; Dremio coordinates the maintenance of reflections in the background, updating them incrementally as new data commits to the base Iceberg tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;SQL Reflection Mechanics and Arrow Flight SQL Serialization&lt;/h3&gt;
&lt;p&gt;Dremio accelerates agent loops using SQL Reflections, which represent pre-computed physical representations of logical query paths. There are two primary types of reflections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Raw Reflections:&lt;/strong&gt; These reflections store a subset of table columns, sorted or partitioned by fields commonly used in filtering or joining. They behave like materialized index layouts but are stored as Iceberg tables on S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregation Reflections:&lt;/strong&gt; These reflections pre-calculate common roll-ups and grouping metrics, storing the aggregated measures along with the dimension dimensions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;During the query compilation phase, Dremio&apos;s cost-based optimizer performs reflection matching:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[ Incoming SQL Query ]
         │
         ▼
[ Cost-Based Optimizer ]
         │
  ┌──────┴────────────────────────────────────────┐
  │ (Checks available Reflections)                │
  ▼                                               ▼
[ Option A: Scan Raw S3 Table ]     [ Option B: Match Reflection Subtree ]
Cost: High I/O, slow scan           Cost: Low I/O, pre-aggregated scan
                                                  │
                                                  ▼
                                    [ Rewrite Query Plan to Reflection ]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the optimizer identifies that the query&apos;s projection and filtering criteria can be satisfied by an active reflection, it automatically swaps the execution plan subtree. The physical plan reads from the reflection&apos;s pre-computed Parquet files instead of scanning millions of raw rows, which reduces latency.&lt;/p&gt;
&lt;p&gt;Once the compute nodes process the data, it must be returned to the client. Flight SQL maximizes this transfer speed by using a vectorized stream layout:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Arrow IPC Format:&lt;/strong&gt; Unlike JDBC, which requires converting binary records to Java objects and then to client formats, Flight SQL keeps records in the Apache Arrow In-Memory format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;gRPC Transportation:&lt;/strong&gt; Data is streamed in chunks over gRPC, bypassing traditional network serialization overhead. This allows the AI agent&apos;s Python process to receive millions of records directly into memory as a PyArrow buffer, accelerating downstream pandas or polars manipulations.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2&gt;4. Execution Flow: Step-by-Step Walkthrough&lt;/h2&gt;
&lt;p&gt;To see how the components interact in production, we trace a query from the initial user request to the final result delivery.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; [User Prompt]
      │
      ▼
┌───────────┐
│ AI Agent  │ ◄─── (Retrieves Virtual Schema)
└─────┬─────┘
      │ (Submits SQL via Arrow Flight)
      ▼
┌───────────┐
│  Dremio   │ ◄─── (Requests Table Pointer &amp;amp; S3 Credentials) ───► ┌─────────┐
└─────┬─────┘                                                     │ Polaris │
      │                                                           └─────────┘
      │ (Applies Row Filters &amp;amp; SSN Masking)
      ▼
┌───────────┐
│ NVMe Cache│ ◄─── (Reads Cached Parquet Blocks or S3 Streams)
└─────┬─────┘
      │
      ▼ (Returns Vectorized Arrow Stream)
 [AI Agent]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 1: User Request&lt;/h3&gt;
&lt;p&gt;A business manager inputs a query to the agent interface: &lt;em&gt;&amp;quot;Identify the total revenue generated by premium tier subscribers in the Northwest region during the first quarter of 2026.&amp;quot;&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Step 2: Semantic Analysis and Schema Discovery&lt;/h3&gt;
&lt;p&gt;The AI Agent parses the request. It uses PyIceberg or a metadata utility to query Dremio&apos;s semantic schema. The agent retrieves the virtual dataset definition for &amp;lt;code&amp;gt;Corporate_Sales.Subscription_Details&amp;lt;/code&amp;gt;, reading column tags and descriptions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;dataset&amp;quot;: &amp;quot;Corporate_Sales.Subscription_Details&amp;quot;,
  &amp;quot;columns&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;subscriber_id&amp;quot;,
      &amp;quot;type&amp;quot;: &amp;quot;STRING&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Unique identifier for customer accounts&amp;quot;
    },
    {
      &amp;quot;name&amp;quot;: &amp;quot;tier&amp;quot;,
      &amp;quot;type&amp;quot;: &amp;quot;STRING&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Subscription tier, values include &apos;Basic&apos;, &apos;Standard&apos;, &apos;Premium&apos;&amp;quot;
    },
    {
      &amp;quot;name&amp;quot;: &amp;quot;region&amp;quot;,
      &amp;quot;type&amp;quot;: &amp;quot;STRING&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Geographical region code&amp;quot;
    },
    {
      &amp;quot;name&amp;quot;: &amp;quot;monthly_rate&amp;quot;,
      &amp;quot;type&amp;quot;: &amp;quot;DECIMAL&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Billing rate per month&amp;quot;
    },
    {
      &amp;quot;name&amp;quot;: &amp;quot;signup_date&amp;quot;,
      &amp;quot;type&amp;quot;: &amp;quot;TIMESTAMP&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Timestamp of account creation&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: SQL Generation&lt;/h3&gt;
&lt;p&gt;The agent uses its internal LLM reasoning block to construct a SQL query based on the virtual dataset schema:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  region,
  SUM(monthly_rate * 3) as q1_revenue
FROM Corporate_Sales.Subscription_Details
WHERE tier = &apos;Premium&apos;
  AND region = &apos;Northwest&apos;
  AND signup_date BETWEEN &apos;2026-01-01 00:00:00&apos; AND &apos;2026-03-31 23:59:59&apos;
GROUP BY region;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: SQL Submission&lt;/h3&gt;
&lt;p&gt;The agent submits the generated SQL query to Dremio using Arrow Flight SQL.&lt;/p&gt;
&lt;h3&gt;Step 5: Catalog Authentication and Pointer Resolution&lt;/h3&gt;
&lt;p&gt;Dremio’s query planner receives the SQL. Before executing the plan, Dremio contacts the &lt;strong&gt;Apache Polaris&lt;/strong&gt; catalog:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Dremio authenticates with Polaris using its OAuth2 client credentials.&lt;/li&gt;
&lt;li&gt;Dremio requests the Iceberg metadata pointer for the physical tables referenced by the virtual dataset.&lt;/li&gt;
&lt;li&gt;Polaris validates Dremio&apos;s permissions, generates short-lived, read-only IAM access tokens for the specific S3 file paths, and returns the pointer to the latest Iceberg metadata JSON file.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 6: Plan Optimization and Security Enforcement&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s optimizer applies fine-grained access control policies and plan rewrites:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row Filtering:&lt;/strong&gt; Dremio checks the agent’s execution role. If the role restricts access to specific regions, Dremio automatically injects additional filters (e.g., &amp;lt;code&amp;gt;AND region = &apos;Northwest&apos;&amp;lt;/code&amp;gt;) into the query tree.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column Masking:&lt;/strong&gt; If the query requested sensitive user fields, Dremio applies masking expressions to redact them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflection Matching:&lt;/strong&gt; Dremio checks if a matching reflection (such as an aggregation reflection on revenue columns) is available, rewriting the plan to scan the reflection.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 7: Execution and Data Ingestion&lt;/h3&gt;
&lt;p&gt;Dremio’s execution worker nodes process the query:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The nodes check Dremio&apos;s &lt;strong&gt;Columnar Cloud Cache (C3)&lt;/strong&gt;. If the required Parquet blocks are already cached on the workers&apos; local NVMe SSD drives, they read them instantly.&lt;/li&gt;
&lt;li&gt;Any missing data blocks are streamed directly from S3 using the temporary credentials vended by Polaris.&lt;/li&gt;
&lt;li&gt;The query engine performs the aggregation and filtering in memory using Arrow columnar structures.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 8: Vectorized Result Stream&lt;/h3&gt;
&lt;p&gt;Dremio streams the resulting dataset back to the AI Agent over Arrow Flight SQL. The agent receives the data directly into a local Python Polars dataframe without serialization delays.&lt;/p&gt;
&lt;h3&gt;Step 9: Response Generation&lt;/h3&gt;
&lt;p&gt;The agent analyzes the table data and outputs a natural-language response to the user: &lt;em&gt;&amp;quot;Premium tier subscribers in the Northwest region generated a total of 14,250,300 dollars in revenue during Q1 2026.&amp;quot;&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;5. Security and Governance Controls&lt;/h2&gt;
&lt;p&gt;To deploy this reference architecture in enterprise environments, you must implement strict safety boundaries at each layer of the stack.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                   SECURITY BOUNDARIES                  │
│                                                        │
│   ┌──────────────────────────────────────────────┐     │
│   │  AI Agent Prompt Sanitization (LLM Guard)    │     │
│   └──────────────────────┬───────────────────────┘     │
│                          ▼                             │
│   ┌──────────────────────────────────────────────┐     │
│   │  Dremio Semantic Layer Row &amp;amp; Column RBAC     │     │
│   └──────────────────────┬───────────────────────┘     │
│                          ▼                             │
│   ┌──────────────────────────────────────────────┐     │
│   │  Apache Polaris REST Credential Vending      │     │
│   └──────────────────────┬───────────────────────┘     │
│                          ▼                             │
│   ┌──────────────────────────────────────────────┐     │
│   │  Cloud KMS Encryption &amp;amp; IAM Buckets Policies │     │
│   └──────────────────────────────────────────────┘     │
└────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Restricting Catalog Permissions in Polaris&lt;/h3&gt;
&lt;p&gt;The role-based access controls in Polaris should be configured to isolate engines based on their operational duties. The AI Agent’s query interface should connect to Dremio using a dedicated, read-only credential. Dremio’s catalog client in Polaris must only hold the &amp;lt;code&amp;gt;TABLE_READ&amp;lt;/code&amp;gt; role on specific namespaces, preventing the engine from executing DDL commands (like &amp;lt;code&amp;gt;DROP TABLE&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;ALTER TABLE&amp;lt;/code&amp;gt;) even if a malicious prompt injection occurs.&lt;/p&gt;
&lt;h3&gt;Centralized Data Masking in Dremio&lt;/h3&gt;
&lt;p&gt;Enforce data masking policies inside Dremio&apos;s semantic layer, rather than relying on application code. Masking policies must automatically replace sensitive identifiers (like credit cards, emails, or government IDs) with hashed strings or default masks unless the user role is authorized to view them. This ensures that raw personal data is never loaded into the agent&apos;s LLM context window.&lt;/p&gt;
&lt;h3&gt;S3 Object Storage Encryption&lt;/h3&gt;
&lt;p&gt;Ensure that all Parquet data files and metadata logs are encrypted at rest using server-side encryption with customer-managed keys (SSE-KMS) inside cloud object storage. When Polaris vends credentials, it should only vend read permissions for the specific keys corresponding to the active table paths, maintaining strict file-level isolation.&lt;/p&gt;
&lt;h3&gt;Custom IAM Policies for Polaris Credential Vending&lt;/h3&gt;
&lt;p&gt;To implement credential vending securely, the IAM role assumed by Polaris must have a policy that allows it to delegate access to S3. Below is an example of an AWS IAM policy attached to the Polaris catalog execution role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;S3BucketList&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Action&amp;quot;: &amp;quot;s3:ListBucket&amp;quot;,
      &amp;quot;Resource&amp;quot;: &amp;quot;arn:aws:s3:::lakehouse-warehouse&amp;quot;,
      &amp;quot;Condition&amp;quot;: {
        &amp;quot;StringLike&amp;quot;: {
          &amp;quot;s3:prefix&amp;quot;: [&amp;quot;db/*&amp;quot;]
        }
      }
    },
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;S3ObjectReadWrite&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Action&amp;quot;: [&amp;quot;s3:GetObject&amp;quot;, &amp;quot;s3:PutObject&amp;quot;, &amp;quot;s3:DeleteObject&amp;quot;],
      &amp;quot;Resource&amp;quot;: &amp;quot;arn:aws:s3:::lakehouse-warehouse/db/*&amp;quot;
    },
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;KMSEncryption&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
      &amp;quot;Action&amp;quot;: [&amp;quot;kms:Decrypt&amp;quot;, &amp;quot;kms:GenerateDataKey&amp;quot;],
      &amp;quot;Resource&amp;quot;: &amp;quot;arn:aws:kms:us-east-1:123456789012:key/my-key-uuid&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This policy grants Polaris the ability to list directories and read or write files within the &lt;code&gt;db/&lt;/code&gt; path, while also securing the data using KMS keys.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Real-World Implementation Guide: Setting Up the Architecture&lt;/h2&gt;
&lt;p&gt;To deploy this reference architecture, follow these implementation steps.&lt;/p&gt;
&lt;h3&gt;Step 1: Configuring Apache Polaris REST Catalog&lt;/h3&gt;
&lt;p&gt;Start Polaris and create a new catalog instance pointing to your multicloud S3 storage warehouse.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Create a storage credential configuration in Polaris
curl -i -X POST http://polaris-service:8181/api/v1/catalog-roles \
  -H &amp;quot;Authorization: Bearer $ADMIN_TOKEN&amp;quot; \
  -H &amp;quot;Content-Type: application/json&amp;quot; \
  -d &apos;{
    &amp;quot;name&amp;quot;: &amp;quot;aws-storage-role&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;role-arn&amp;quot;: &amp;quot;arn:aws:iam::123456789012:role/PolarisS3Access&amp;quot;
    }
  }&apos;

# Create the Iceberg catalog
curl -i -X POST http://polaris-service:8181/api/v1/catalogs \
  -H &amp;quot;Authorization: Bearer $ADMIN_TOKEN&amp;quot; \
  -H &amp;quot;Content-Type: application/json&amp;quot; \
  -d &apos;{
    &amp;quot;name&amp;quot;: &amp;quot;enterprise_warehouse&amp;quot;,
    &amp;quot;type&amp;quot;: &amp;quot;INTERNAL&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;default-base-location&amp;quot;: &amp;quot;s3a://lakehouse-warehouse/&amp;quot;
    }
  }&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Registering Polaris Catalog in Dremio&lt;/h3&gt;
&lt;p&gt;To connect Dremio to your Apache Polaris REST Catalog:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Open the Dremio administrator console.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Source&lt;/strong&gt; and select &lt;strong&gt;Apache Iceberg&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set the Connection Type to &lt;strong&gt;REST&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set the REST URI to &amp;lt;code&amp;gt;http://polaris-service:8181/api/v1&amp;lt;/code&amp;gt;.&lt;/li&gt;
&lt;li&gt;Set the Authentication method to &lt;strong&gt;OAuth2 Client Credentials&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Input the Client ID and Client Secret generated during your Polaris setup.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Establishing Row-Level Security in Dremio&lt;/h3&gt;
&lt;p&gt;Create a row filter policy in Dremio to restrict database access based on user role assignments:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE ROW FILTER enterprise_warehouse.db.sales_data.region_filter
ON enterprise_warehouse.db.sales_data
USING (
  CASE
    WHEN IS_MEMBER(&apos;Executive&apos;) THEN TRUE
    WHEN IS_MEMBER(&apos;Regional_Sales_North&apos;) THEN region_code = &apos;US-NORTH&apos;
    WHEN IS_MEMBER(&apos;Regional_Sales_South&apos;) THEN region_code = &apos;US-SOUTH&apos;
    ELSE FALSE
  END
);

ALTER TABLE enterprise_warehouse.db.sales_data ADD ROW FILTER region_filter;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Fetching Data via Arrow Flight SQL in Python&lt;/h3&gt;
&lt;p&gt;Use the &amp;lt;code&amp;gt;pyarrow.flight&amp;lt;/code&amp;gt; client library to establish a high-speed columnar connection from the AI Agent Python framework directly to Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.flight as flight
from pyarrow.flight import FlightClient, Ticket

# Establish connection to Dremio coordinator node
client = FlightClient(&amp;quot;grpc+tcp://dremio-coordinator:32010&amp;quot;)

# Authenticate client credentials
auth_handler = flight.ClientAuthHandler()
# (Configure custom authentication handshake)

# Define query ticket representing the SQL execution command
sql_query = &amp;quot;SELECT * FROM enterprise_warehouse.db.sales_data&amp;quot;
ticket_bytes = Ticket(sql_query.encode(&apos;utf-8&apos;))

# Stream results vectorially into PyArrow table
reader = client.do_get(ticket_bytes)
arrow_table = reader.read_all()

# Convert Arrow table directly to Polars DataFrame for agent analysis
import polars as pl
df = pl.from_arrow(arrow_table)
print(df.head())
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 5: Implementing an Agentic Reasoning Loop&lt;/h3&gt;
&lt;p&gt;To build an agent that interacts with Dremio dynamically, you can construct a python execution class that receives natural language prompts, translates them to SQL queries, runs them over Arrow Flight, and returns a summarized answer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import openai
import pyarrow.flight as flight
from pyarrow.flight import FlightClient, Ticket
import polars as pl

class DremioAgent:
    def __init__(self, dremio_host, dremio_port, openai_api_key):
        self.flight_client = FlightClient(f&amp;quot;grpc+tcp://{dremio_host}:{dremio_port}&amp;quot;)
        self.openai_client = openai.OpenAI(api_key=openai_api_key)
        self.schema_context = self.load_schema_context()

    def load_schema_context(self):
        # Queries Dremio metadata to load table descriptions
        query = &amp;quot;&amp;quot;&amp;quot;
        SELECT table_name, column_name, data_type
        FROM INFORMATION_SCHEMA.COLUMNS
        WHERE table_schema = &apos;enterprise_warehouse.db&apos;
        &amp;quot;&amp;quot;&amp;quot;
        ticket = Ticket(query.encode(&apos;utf-8&apos;))
        reader = self.flight_client.do_get(ticket)
        arrow_table = reader.read_all()
        df = pl.from_arrow(arrow_table)

        # Format the schema as prompt context
        context = &amp;quot;Available Tables:\n&amp;quot;
        for row in df.iter_rows(named=True):
            context += f&amp;quot;Table: {row[&apos;table_name&apos;]}, Column: {row[&apos;column_name&apos;]}, Type: {row[&apos;data_type&apos;]}\n&amp;quot;
        return context

    def execute_query(self, sql_query):
        try:
            ticket = Ticket(sql_query.encode(&apos;utf-8&apos;))
            reader = self.flight_client.do_get(ticket)
            arrow_table = reader.read_all()
            return pl.from_arrow(arrow_table)
        except Exception as e:
            return f&amp;quot;Query Execution Error: {str(e)}&amp;quot;

    def run(self, user_prompt):
        # Step 1: Generate SQL query using OpenAI LLM
        prompt = f&amp;quot;&amp;quot;&amp;quot;
        You are an AI Agent with read-only access to a data lakehouse.
        Using the following schema context, generate an ANSI SQL query to answer the user&apos;s request.
        Do not explain the query. Return only the raw SQL query.

        {self.schema_context}

        Request: {user_prompt}
        SQL Query:
        &amp;quot;&amp;quot;&amp;quot;

        response = self.openai_client.chat.completions.create(
            model=&amp;quot;gpt-4o&amp;quot;,
            messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: prompt}],
            temperature=0.0
        )
        generated_sql = response.choices[0].message.content.strip()
        print(f&amp;quot;Generated SQL: {generated_sql}&amp;quot;)

        # Step 2: Execute the query over Arrow Flight SQL
        result_df = self.execute_query(generated_sql)

        if isinstance(result_df, str):
            return f&amp;quot;Failed to retrieve data. {result_df}&amp;quot;

        # Step 3: Summarize results
        summary_prompt = f&amp;quot;&amp;quot;&amp;quot;
        Summarize the following data to answer the user&apos;s question: &apos;{user_prompt}&apos;

        Data:
        {result_df.head(10).to_string()}

        Summary:
        &amp;quot;&amp;quot;&amp;quot;

        summary_response = self.openai_client.chat.completions.create(
            model=&amp;quot;gpt-4o&amp;quot;,
            messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: summary_prompt}],
            temperature=0.2
        )
        return summary_response.choices[0].message.content.strip()

# Usage Example
# agent = DremioAgent(&amp;quot;dremio-coordinator&amp;quot;, 32010, &amp;quot;your-openai-api-key&amp;quot;)
# print(agent.run(&amp;quot;What are the top three customer segments by revenue?&amp;quot;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This class demonstrates how agents can query a lakehouse platform dynamically while leveraging the performance benefits of Apache Arrow.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;7. Comparative Architecture Analysis&lt;/h2&gt;
&lt;p&gt;To evaluate how this reference architecture performs against legacy data warehouse models and basic RAG setups, refer to the analysis below:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Feature&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Legacy Warehouse (Redshift / Snowflake Native)&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Basic RAG (Vector DB + File Search)&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Agentic Lakehouse (Iceberg + Dremio + Polaris)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Data Interoperability&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Tightly bound to proprietary storage formats and engines.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Unstructured documents, no support for relational queries.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Natively open (Iceberg); data is shared across Spark, Dremio, and Python.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Egress Fees and Cloud Costs&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;High cost to duplicate data across cloud environments.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Duplicated text chunks stored in local vector index files.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Zero data copying; Polaris routes queries to files stored in local clouds.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Query Latency (Agents)&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Moderate to slow due to driver serialization limits.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Fast vector lookup, but slow for tabular analytics.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Sub-second speeds via Dremio reflections, C3 NVMe cache, and Flight SQL.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Security Enforcements&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Hard-coded database schemas and service credentials.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;No database-level governance; files are read fully by script.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;REST catalog vends short-lived IAM credentials; Dremio enforces Row/Column RBAC.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Building a governed data lakehouse optimized for AI Agents requires a modern stack. Storing data in &lt;strong&gt;Apache Iceberg&lt;/strong&gt; tables on object storage ensures that multiple engines can access files concurrently. Managing table pointers via &lt;strong&gt;Apache Polaris&lt;/strong&gt; REST APIs coordinates secure, cross-cloud access. Deploying &lt;strong&gt;Dremio&lt;/strong&gt; as the semantic and query acceleration tier provides the necessary business metadata structure, row and column security boundaries, and Arrow Flight SQL execution speeds to support autonomous AI agent loops.&lt;/p&gt;
&lt;p&gt;By implementing this reference architecture, enterprise organizations can deploy secure, performant, and cost-effective AI agents that query multicloud datasets without data duplication or vendor lock-in.&lt;/p&gt;
&lt;p&gt;If you are ready to evaluate table format performance in detail, read our adjacent guide on &lt;a href=&quot;/benchmarks/open-table-formats/&quot;&gt;benchmarking open table formats&lt;/a&gt; or learn more about &lt;a href=&quot;/apache-iceberg/&quot;&gt;Apache Iceberg Architecture&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Common Misconceptions About Data Lakehouse and Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2026-05-22-data-lakehouse-and-iceberg-misconceptions/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-05-22-data-lakehouse-and-iceberg-misconceptions/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-07-discovering-or-organizing-lakeho...</description><pubDate>Fri, 22 May 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-07-discovering-or-organizing-lakehouse-iceberg-meetups/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The open data lakehouse has emerged as the standard architecture for modern data platforms. By combining the governance and transactions of a data warehouse with the scale and cost efficiency of a data lake, the lakehouse allows organizations to run analytics, business intelligence, and machine learning on a single copy of data.&lt;/p&gt;
&lt;p&gt;However, as adoption has surged, so has architectural confusion. A significant amount of terminology overlap and vendor-specific marketing has led to misconceptions among data engineers and architects. In this guide, we address the most common misconceptions about data lakehouse architectures and Apache Iceberg. We analyze the underlying design principles and metadata behaviors to help you build a robust and highly performant data stack.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;1. Misconception: &amp;quot;A Data Lakehouse Is Just a Database on Object Storage&amp;quot;&lt;/h2&gt;
&lt;p&gt;A common initial reaction from database developers looking at the lakehouse pattern is: &lt;em&gt;&amp;quot;Is this not just a traditional database, but we are placing the files in S3 or ADLS instead of local disks?&amp;quot;&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Decoupling Compute and Storage&lt;/h3&gt;
&lt;p&gt;A traditional relational database (RDBMS) or cloud data warehouse (like Snowflake or Google BigQuery in its traditional model) tightly couples its storage format with its query engine. Only the database&apos;s own query processor can read or write the internal files. These engines utilize proprietary, closed layouts that organize data into customized page structures. For example, PostgreSQL organizes records into 8 KB page blocks, while cloud warehouses use custom columnar formats optimized for their own execution networks.&lt;/p&gt;
&lt;p&gt;A data lakehouse separates the two layers entirely:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                     COMPUTE LAYER                      │
│   ┌───────────────┐ ┌───────────────┐ ┌─────────────┐  │
│   │  Dremio (SQL) │ │ Apache Spark  │ │ Flink (CDC) │  │
│   └───────┬───────┘ └───────┬───────┘ └───────┬─────┘  │
└───────────┼─────────────────┼─────────────────┼────────┘
            │                 │                 │
            ▼                 ▼                 ▼
┌────────────────────────────────────────────────────────┐
│                     METADATA LAYER                     │
│                    [Apache Iceberg]                    │
└───────────────────────────┬────────────────────────────┘
                            │
                            ▼
┌────────────────────────────────────────────────────────┐
│                     STORAGE LAYER                      │
│             [Cloud Object Storage (S3/GCS)]            │
└────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In a decoupled architecture, the storage tier consists of open file formats like Apache Parquet or ORC, and the metadata tier uses open table formats like Apache Iceberg. Because the metadata and data structures are open standards, multiple query engines can query and modify the same data files simultaneously.&lt;/p&gt;
&lt;p&gt;For instance, you can ingest high-velocity data using Apache Flink, run heavy batch transformation jobs using Apache Spark, and execute interactive SQL queries or BI dashboards using Dremio, all pointing to the exact same files on S3. No data duplication or ingestion pipelines are needed to sync data between different tools.&lt;/p&gt;
&lt;h3&gt;Query Engine Execution on Open Formats&lt;/h3&gt;
&lt;p&gt;When you submit a query to Dremio or Spark querying an Iceberg table, the engine does not request data through a proprietary storage manager. Instead, it reads the Iceberg metadata files directly from S3 to determine which specific Parquet files contain the matching rows. The engine then uses its own execution workers to fetch those Parquet files, process the columns in memory, and stream the results back.&lt;/p&gt;
&lt;p&gt;This model eliminates vendor lock-in. If a new, faster query engine or a cheaper processing tool is introduced, you can immediately point it at your existing Iceberg tables on S3 and begin querying. You do not need to execute expensive migrations or format conversions.&lt;/p&gt;
&lt;p&gt;Furthermore, decoupling compute from storage changes the economics of data platform scaling. In a traditional database, if you need more compute power to support concurrent queries during business hours, you must scale the entire database cluster. This scales storage and compute in lockstep, forcing you to pay for storage you do not need. In a lakehouse, storage capacity is billed at cheap object storage rates (approximately 23 dollars per terabyte per month on AWS S3), while compute resources can be scaled up or paused dynamically on an hourly basis.&lt;/p&gt;
&lt;h3&gt;Columnar Cloud Caching and Vectorized Execution&lt;/h3&gt;
&lt;p&gt;To achieve performance parity with coupled systems, modern lakehouse engines employ advanced caching and execution strategies. For example, Dremio implements a Columnar Cloud Cache (C3) that automatically caches Parquet blocks on local NVMe SSDs in compute nodes as queries run. Subsequent queries bypass the object store, fetching data directly from local NVMe, which reduces read latency. Dremio also processes data in-memory using Apache Arrow, which organizes records in columnar format, maximizing CPU cache locality and enabling SIMD hardware vectorization. To avoid traditional JDBC or ODBC serialization bottlenecks, clients fetch results via Arrow Flight SQL, which streams data over gRPC, ensuring high throughput for AI agents and analytical applications.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;2. Misconception: &amp;quot;Apache Iceberg Replaces My Catalog&amp;quot;&lt;/h2&gt;
&lt;p&gt;Another common point of confusion is the role of the metadata catalog. Because Apache Iceberg has configuration settings and classes called catalogs (such as &amp;lt;code&amp;gt;GlueCatalog&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;NessieCatalog&amp;lt;/code&amp;gt;, or &amp;lt;code&amp;gt;RestCatalog&amp;lt;/code&amp;gt;), many believe that adopting Iceberg eliminates the need for an external catalog.&lt;/p&gt;
&lt;h3&gt;The Metadata Hierarchy&lt;/h3&gt;
&lt;p&gt;To understand why this is a misconception, we must look at how Apache Iceberg coordinates a table change. Iceberg structures metadata hierarchically:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                  ┌──────────────────────┐
                  │    Catalog Pointer   │
                  └──────────┬───────────┘
                             │ (Resolves to latest metadata JSON)
                  ┌──────────▼───────────┐
                  │  Table Metadata JSON │
                  └──────────┬───────────┘
                             │ (Tracks snapshots)
                  ┌──────────▼───────────┐
                  │     Manifest List    │
                  └──────────┬───────────┘
                             │ (Groups manifest files)
                  ┌──────────▼───────────┐
                  │     Manifest File    │
                  └──────────┬───────────┘
                             │ (Tracks individual Parquet files)
                  ┌──────────▼───────────┐
                  │   Data/Delete Files  │
                  └──────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Table Metadata File (JSON):&lt;/strong&gt; This file stores the table&apos;s schema, partition specifications, properties, and a history of previous snapshots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest List (Avro):&lt;/strong&gt; Every commit or snapshot creates a manifest list. This file contains a list of manifest files that make up that specific snapshot, along with stats for each manifest (like partition ranges and file counts).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest File (Avro):&lt;/strong&gt; Manifests track individual data and delete files. They store column-level statistics (min/max values, null counts) for each Parquet file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data and Delete Files:&lt;/strong&gt; The physical files (usually Parquet) that contain the actual records.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;An Anatomy of the Table Metadata JSON&lt;/h3&gt;
&lt;p&gt;To illustrate this, here is a simplified representation of what is tracked in the table metadata JSON file (e.g., &amp;lt;code&amp;gt;v1.metadata.json&amp;lt;/code&amp;gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;format-version&amp;quot;: 2,
  &amp;quot;table-uuid&amp;quot;: &amp;quot;d54d2452-f542-4f36-a192-3852086e3f28&amp;quot;,
  &amp;quot;location&amp;quot;: &amp;quot;s3a://lakehouse-warehouse/db/user_events&amp;quot;,
  &amp;quot;last-sequence-number&amp;quot;: 2,
  &amp;quot;last-updated-ms&amp;quot;: 1779494400000,
  &amp;quot;last-column-id&amp;quot;: 4,
  &amp;quot;schemas&amp;quot;: [
    {
      &amp;quot;type&amp;quot;: &amp;quot;struct&amp;quot;,
      &amp;quot;schema-id&amp;quot;: 0,
      &amp;quot;fields&amp;quot;: [
        { &amp;quot;id&amp;quot;: 1, &amp;quot;name&amp;quot;: &amp;quot;event_id&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
        { &amp;quot;id&amp;quot;: 2, &amp;quot;name&amp;quot;: &amp;quot;user_id&amp;quot;, &amp;quot;required&amp;quot;: true, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
        {
          &amp;quot;id&amp;quot;: 3,
          &amp;quot;name&amp;quot;: &amp;quot;event_time&amp;quot;,
          &amp;quot;required&amp;quot;: true,
          &amp;quot;type&amp;quot;: &amp;quot;timestamp&amp;quot;
        },
        { &amp;quot;id&amp;quot;: 4, &amp;quot;name&amp;quot;: &amp;quot;payload&amp;quot;, &amp;quot;required&amp;quot;: false, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
      ]
    }
  ],
  &amp;quot;current-schema-id&amp;quot;: 0,
  &amp;quot;partition-specs&amp;quot;: [
    {
      &amp;quot;spec-id&amp;quot;: 0,
      &amp;quot;fields&amp;quot;: [
        {
          &amp;quot;source-id&amp;quot;: 3,
          &amp;quot;field-id&amp;quot;: 1000,
          &amp;quot;name&amp;quot;: &amp;quot;event_time_day&amp;quot;,
          &amp;quot;transform&amp;quot;: &amp;quot;day&amp;quot;
        }
      ]
    }
  ],
  &amp;quot;default-spec-id&amp;quot;: 0,
  &amp;quot;snapshots&amp;quot;: [
    {
      &amp;quot;snapshot-id&amp;quot;: 8374928172948293,
      &amp;quot;timestamp-ms&amp;quot;: 1779494400000,
      &amp;quot;manifest-list&amp;quot;: &amp;quot;s3a://lakehouse-warehouse/db/user_events/metadata/snap-8374928172948293.avro&amp;quot;,
      &amp;quot;summary&amp;quot;: {
        &amp;quot;operation&amp;quot;: &amp;quot;append&amp;quot;,
        &amp;quot;added-data-files&amp;quot;: &amp;quot;4&amp;quot;
      }
    }
  ],
  &amp;quot;current-snapshot-id&amp;quot;: 8374928172948293
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Role of the Catalog&lt;/h3&gt;
&lt;p&gt;The hierarchical metadata structure works beautifully, but it introduces a problem. Every write, update, or delete operation writes a new table metadata JSON file. How do query engines reading the table know which metadata JSON file represents the current state? And how do concurrent writers prevent overwriting each other&apos;s commits?&lt;/p&gt;
&lt;p&gt;This is where the catalog is required. The catalog serves as the single source of truth for the location of the latest table metadata JSON file. It coordinates transactions using a Compare-And-Swap (CAS) pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;When a writer (like Spark) wants to update a table, it reads the current metadata JSON file path from the catalog (e.g., &amp;lt;code&amp;gt;v1.metadata.json&amp;lt;/code&amp;gt;).&lt;/li&gt;
&lt;li&gt;The writer writes new data files and compiles a new metadata JSON file (e.g., &amp;lt;code&amp;gt;v2.metadata.json&amp;lt;/code&amp;gt;).&lt;/li&gt;
&lt;li&gt;The writer requests the catalog to atomically swap the table pointer from &amp;lt;code&amp;gt;v1.metadata.json&amp;lt;/code&amp;gt; to &amp;lt;code&amp;gt;v2.metadata.json&amp;lt;/code&amp;gt;.&lt;/li&gt;
&lt;li&gt;If another writer committed a change in the meantime, the catalog rejects the swap. The first writer must then read the new state, merge its changes, and try the commit swap again.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Without an Iceberg catalog (such as Apache Polaris, Nessie, AWS Glue, or a REST catalog implementation), multiple engines could write to the same table concurrently, leading to silent data corruption or overwritten snapshots. Iceberg defines the table format structure, while the catalog manages transaction coordination and pointer safety.&lt;/p&gt;
&lt;h3&gt;Nessie and Polaris: Distinct Architectural Choices&lt;/h3&gt;
&lt;p&gt;Architects selecting an Iceberg catalog often choose between stateless REST catalogs and transactional catalog databases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Polaris:&lt;/strong&gt; A lightweight, open-source REST catalog that exposes standard Iceberg REST endpoints. It acts as a stateless service that enforces role-based access control and integrates with identity providers using OAuth2. Its primary mechanism is credential vending: Polaris generates scoped, short-lived AWS IAM, GCP IAM, or Azure SAS security tokens, allowing engines like Dremio or Spark to access only the specific cloud storage locations needed for a query, preventing broad storage access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project Nessie:&lt;/strong&gt; A transactional catalog database that brings Git-like version control to data lakehouses. It tracks commits in a database backend (such as PostgreSQL, DynamoDB, or Cassandra) and structures metadata references as a commit graph. This allows teams to create branches (for example, &lt;code&gt;CREATE BRANCH dev FROM main&lt;/code&gt;), perform multi-table ETL transformations in isolation, verify data quality, and merge the branch back to the production branch atomically, guaranteeing that queries running on production never see partial updates.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;3. Misconception: &amp;quot;Apache Iceberg Causes Partition Lock-In&amp;quot;&lt;/h2&gt;
&lt;p&gt;Under the legacy Hive table format, partition layouts are hard-coded into the directory structure of the object storage (for example: &amp;lt;code&amp;gt;s3://bucket/table/year=2026/month=05/day=22/data.parquet&amp;lt;/code&amp;gt;). This directory-based partitioning model creates two severe limitations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Leakage:&lt;/strong&gt; Users must manually include partition columns in their SQL filters (e.g., &amp;lt;code&amp;gt;WHERE year = 2026 AND month = 5 AND day = 22&amp;lt;/code&amp;gt;). If a user queries the table using only a timestamp filter (e.g., &amp;lt;code&amp;gt;WHERE event_time &amp;gt;= &apos;2026-05-22 00:00:00&apos;&amp;lt;/code&amp;gt;), the engine must list every single directory in the S3 bucket to find the data, causing query times to spike.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Lock-In:&lt;/strong&gt; If the table&apos;s query patterns change (for example, switching from partitioning by day to partitioning by hour because data volume tripled), you must execute an expensive job to rewrite the entire historical dataset into a new directory structure.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because of this legacy behavior, some developers assume that Apache Iceberg also locks you into whatever partition strategy you define during table creation.&lt;/p&gt;
&lt;h3&gt;Hidden Partitioning&lt;/h3&gt;
&lt;p&gt;Apache Iceberg eliminates both issues through a capability called &lt;strong&gt;Hidden Partitioning&lt;/strong&gt;. When you create an Iceberg table, you define partitions based on transformations of existing columns, rather than creating new virtual partition columns.&lt;/p&gt;
&lt;p&gt;For example, consider the following table definition:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE demo.db.user_events (
  event_id STRING,
  user_id STRING,
  event_time TIMESTAMP,
  payload STRING
)
USING iceberg
PARTITIONED BY (days(event_time));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Iceberg tracks partition values internally within the metadata of each data file. The physical directory structure on S3 is completely hidden from the user and the query engine.&lt;/p&gt;
&lt;p&gt;When a query is submitted:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT COUNT(*)
FROM demo.db.user_events
WHERE event_time BETWEEN &apos;2026-05-01 00:00:00&apos; AND &apos;2026-05-05 23:59:59&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Iceberg inspects the query filter, recognizes that the filter targets &amp;lt;code&amp;gt;event_time&amp;lt;/code&amp;gt;, and applies the daily partition transformation internally. It prunes out all partition files that do not fall within the requested date range before the query reaches the execution stage. Users do not need to know how the table is partitioned to write fast queries.&lt;/p&gt;
&lt;h3&gt;Partition Evolution&lt;/h3&gt;
&lt;p&gt;If your data volume increases and you decide that partitioning by day is no longer sufficient, you can evolve the partitioning spec instantly using a metadata-only command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE demo.db.user_events REPLACE PARTITION FIELD event_time WITH hours(event_time);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once this command is executed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;All new writes to the table are automatically partitioned by hour.&lt;/li&gt;
&lt;li&gt;The old historical data remains partitioned by day.&lt;/li&gt;
&lt;li&gt;Iceberg updates the table metadata JSON file to track two distinct partition specifications (spec 0 for daily, and spec 1 for hourly).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When a query is run, Iceberg uses &lt;strong&gt;split-planning&lt;/strong&gt; to query the daily partition files using spec 0 and the hourly partition files using spec 1, stitching the results together seamlessly. The user does not need to know that the partition layout changed, and the organization avoids the time and cost of rewriting historical data.&lt;/p&gt;
&lt;p&gt;Here is the internal mechanics of split-planning. During the planning phase, the engine reads the partition spec history from the table metadata. If the query covers a time range spanning both spec 0 and spec 1, the engine splits the plan into two scan tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Task A (Spec 0):&lt;/strong&gt; Evaluates partition daily buckets for dates before the evolution event.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task B (Spec 1):&lt;/strong&gt; Evaluates partition hourly buckets for timestamps after the evolution event.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The engine executes these scan tasks in parallel, and unions the output blocks. This design prevents partition layout evolution from ever requiring historical data migrations.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;4. Misconception: &amp;quot;Time Travel Requires Storing Infinite Duplicate Data&amp;quot;&lt;/h2&gt;
&lt;p&gt;A major benefit of Apache Iceberg is the ability to query previous snapshots of a table. This is highly useful for debugging, auditing, or running reproducible machine learning models. A common concern among data engineers is that keeping months of historical snapshots will cause storage costs on S3 to grow exponentially as duplicate data files accumulate.&lt;/p&gt;
&lt;h3&gt;How Snapshot-Based Metadata Sharing Works&lt;/h3&gt;
&lt;p&gt;To understand why storage costs do not grow exponentially, we must look at how Iceberg manages commits. When you write data to an Iceberg table, you do not write a new copy of the entire table. Instead, Iceberg uses &lt;strong&gt;metadata sharing&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Snapshot 1 Metadata] ──────► [Manifest List 1] ──────► [Manifest File A] ──────► [Data File 1, Data File 2]

[Snapshot 2 Metadata] ──────► [Manifest List 2] ──────► [Manifest File A] ──────► [Data File 1, Data File 2]
                                                └──────► [Manifest File B] ──────► [Data File 3 (New Data)]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you append 1 million rows of new data to a table containing 100 million rows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Spark writes the new rows to a new Parquet file (e.g., &amp;lt;code&amp;gt;data_3.parquet&amp;lt;/code&amp;gt;).&lt;/li&gt;
&lt;li&gt;Spark writes a new manifest file (e.g., &amp;lt;code&amp;gt;manifest_b.avro&amp;lt;/code&amp;gt;) to track &amp;lt;code&amp;gt;data_3.parquet&amp;lt;/code&amp;gt;.&lt;/li&gt;
&lt;li&gt;Spark writes a new manifest list (e.g., &amp;lt;code&amp;gt;manifest_list_2.avro&amp;lt;/code&amp;gt;) that references both &amp;lt;code&amp;gt;manifest_a.avro&amp;lt;/code&amp;gt; (the old files) and &amp;lt;code&amp;gt;manifest_b.avro&amp;lt;/code&amp;gt; (the new file).&lt;/li&gt;
&lt;li&gt;The table metadata JSON is updated to register Snapshot 2, while retaining the pointer to Snapshot 1.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because Snapshot 2 shares references to the data files created in Snapshot 1, the only additional storage cost is the size of the new Parquet file (&amp;lt;code&amp;gt;data_3.parquet&amp;lt;/code&amp;gt;). There is zero replication of the existing 100 million rows.&lt;/p&gt;
&lt;h3&gt;The Cost of Updates and Deletes&lt;/h3&gt;
&lt;p&gt;While appends are storage-efficient, update and delete operations do increase storage overhead. Under a &lt;strong&gt;Copy-On-Write (COW)&lt;/strong&gt; model, if you update 1 row in a Parquet file containing 1 million rows, the engine must write a new Parquet file containing the 999,999 unchanged rows plus the 1 updated row.&lt;/p&gt;
&lt;p&gt;The old Parquet file cannot be deleted immediately because it is still needed by Snapshot 1. Until Snapshot 1 is expired, both Parquet files remain on S3, creating write amplification and storage overhead.&lt;/p&gt;
&lt;p&gt;Under a &lt;strong&gt;Merge-On-Read (MOR)&lt;/strong&gt; model, the engine does not rewrite the base Parquet file. Instead, it writes a small positional delete file or equality delete file indicating that the specific row was modified, along with a new Parquet file containing only the updated record. This limits write amplification but increases the number of small files that query engines must read and merge at runtime.&lt;/p&gt;
&lt;h3&gt;Mathematical Comparison of Copy-On-Write and Merge-On-Read&lt;/h3&gt;
&lt;p&gt;To understand the operational trade-offs, we can evaluate the write amplification mathematically. Consider a table where each data file is exactly 256 MB and contains approximately 1 million records. If an ETL job updates a single record:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Copy-On-Write (COW):&lt;/strong&gt; The engine must read the entire 256 MB file, modify the target record in memory, and write a new 256 MB Parquet file. This represents a write amplification factor of 1,000,000 to 1, consuming significant disk I/O and network bandwidth to object storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Merge-On-Read (MOR):&lt;/strong&gt; The engine writes only a small delete file (approximately 10 KB) containing the file path and position index of the modified record, plus a small insert file (approximately 5 KB) containing the updated values. The write amplification is effectively zero.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, during a read query on this MOR table, the query engine must scan the 256 MB base file, read the 10 KB delete file, build an in-memory hash set of deleted row positions, and filter them out before joining or aggregating. If there are many delete files, query performance degrades significantly.&lt;/p&gt;
&lt;p&gt;To control these models, you configure Iceberg table properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;write.update.mode&lt;/code&gt;: Sets the update format to either &lt;code&gt;copy-on-write&lt;/code&gt; or &lt;code&gt;merge-on-read&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write.delete.mode&lt;/code&gt;: Sets the delete format to &lt;code&gt;copy-on-write&lt;/code&gt; or &lt;code&gt;merge-on-read&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write.merge.mode&lt;/code&gt;: Sets the merge format (for SQL &lt;code&gt;MERGE INTO&lt;/code&gt; statements) to &lt;code&gt;copy-on-write&lt;/code&gt; or &lt;code&gt;merge-on-read&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Regular compaction using &lt;code&gt;rewrite_data_files&lt;/code&gt; and &lt;code&gt;rewrite_position_deletes&lt;/code&gt; is necessary to merge MOR delete files back into clean data files, reconciling the storage footprint and query performance.&lt;/p&gt;
&lt;h3&gt;Managing Storage Lifecycle and Table Maintenance&lt;/h3&gt;
&lt;p&gt;To prevent historical snapshots from creating runaway storage costs, you must configure a snapshot retention policy and run regular maintenance procedures.&lt;/p&gt;
&lt;h4&gt;Snapshot Expiration&lt;/h4&gt;
&lt;p&gt;To clean up old, unneeded snapshots, run the &amp;lt;code&amp;gt;expire_snapshots&amp;lt;/code&amp;gt; procedure. This removes the references to old snapshots in the metadata JSON and physically deletes any orphaned Parquet files from S3 that are no longer referenced by any active snapshot.&lt;/p&gt;
&lt;p&gt;For example, in Spark SQL, you can run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL demo.system.expire_snapshots(
  table =&amp;gt; &apos;demo.db.user_events&apos;,
  older_than =&amp;gt; TIMESTAMP &apos;2026-05-15 00:00:00&apos;,
  retain_last =&amp;gt; 5
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Behind the scenes, the snapshot expiration algorithm executes these steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Identify all snapshots in the metadata JSON that are older than the specified timestamp.&lt;/li&gt;
&lt;li&gt;Filter this list to preserve the last N snapshots defined by the &amp;lt;code&amp;gt;retain_last&amp;lt;/code&amp;gt; parameter.&lt;/li&gt;
&lt;li&gt;Traverse the manifest lists of all surviving snapshots, compiling a set of all active data and delete files.&lt;/li&gt;
&lt;li&gt;Traverse the manifest lists of the snapshots being expired. Find any files that are referenced in the expired snapshots but are absent from the active file set.&lt;/li&gt;
&lt;li&gt;Physically delete these orphaned data and delete files from the object storage bucket.&lt;/li&gt;
&lt;li&gt;Write a new table metadata JSON file excluding the expired snapshot references, and commit the catalog pointer.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Removing Orphan Files&lt;/h4&gt;
&lt;p&gt;Occasionally, failed write jobs or network dropouts can leave Parquet files on S3 that were never committed to any metadata file. These files are invisible to Iceberg but still incur storage costs. Run the &amp;lt;code&amp;gt;remove_orphan_files&amp;lt;/code&amp;gt; procedure to locate and delete these unreferenced files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL demo.system.remove_orphan_files(
  table =&amp;gt; &apos;demo.db.user_events&apos;,
  older_than =&amp;gt; TIMESTAMP &apos;2026-05-20 00:00:00&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Compacting Data Files&lt;/h4&gt;
&lt;p&gt;Frequent small writes or Merge-On-Read updates create many small data and delete files, which degrades query performance. Run the compaction procedure (&amp;lt;code&amp;gt;rewrite_data_files&amp;lt;/code&amp;gt;) to merge small files into optimized 128 MB or 512 MB Parquet files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL demo.system.rewrite_data_files(
  table =&amp;gt; &apos;demo.db.user_events&apos;,
  strategy =&amp;gt; &apos;sort&apos;,
  sort_order =&amp;gt; &apos;user_id ASC&apos;,
  options =&amp;gt; map(&apos;max-file-size-bytes&apos;, &apos;536870912&apos;)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2&gt;5. Misconception: &amp;quot;Data Lakehouses Lack Fine-Grained Security and Governance&amp;quot;&lt;/h2&gt;
&lt;p&gt;A persistent argument from traditional data warehouse advocates is that open data lakes lack the security controls required by enterprise organizations. They argue that because files are stored openly on S3, you cannot enforce role-based access control (RBAC), row-level filtering, or column-level masking without placing a proprietary database engine in front of the storage bucket.&lt;/p&gt;
&lt;h3&gt;Credential Vending and Access Control&lt;/h3&gt;
&lt;p&gt;This is a misconception that ignores the capabilities of modern open REST Catalogs and semantic layers.&lt;/p&gt;
&lt;p&gt;In an open lakehouse architecture, security is enforced at two distinct levels:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Metadata and Pointer Security (The Catalog):&lt;/strong&gt; Open REST Catalogs like &lt;strong&gt;Apache Polaris&lt;/strong&gt; implement &lt;strong&gt;credential vending&lt;/strong&gt;. When a query engine requests the location of an Iceberg table, it must authenticate with Polaris using OAuth2 tokens. Polaris checks the engine&apos;s role-based access permissions. If authorized, Polaris contacts S3 to generate short-lived, read-only security credentials (like AWS IAM session tokens) for the specific paths containing the table&apos;s data files. The query engine never has permanent read or write access to the raw S3 bucket.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Control Policies in Polaris:&lt;/strong&gt; Polaris allows you to define granular access policies on namespaces and tables, ensuring that different compute engines or tenant groups only see the metadata they are authorized to access.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Fine-Grained Security in the Semantic Layer&lt;/h3&gt;
&lt;p&gt;While the catalog secures the files on object storage, a semantic layer like &lt;strong&gt;Dremio&lt;/strong&gt; enforces fine-grained role-based access control (RBAC), row-level filtering, and column-level masking before results are returned to users or AI agents.&lt;/p&gt;
&lt;h4&gt;Enforcing Row-Level Security&lt;/h4&gt;
&lt;p&gt;If you want sales representatives to only see customer data from their own region, you can define a row-level security policy directly in Dremio. The engine automatically appends filtering conditions to the generated query plan before scanning the Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE ROW FILTER demo.db.customers.region_filter
ON demo.db.customers
USING (
  CASE
    WHEN IS_MEMBER(&apos;Admins&apos;) THEN TRUE
    ELSE region = CURRENT_USER_REGION()
  END
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Enforcing Column-Masking&lt;/h4&gt;
&lt;p&gt;Similarly, you can mask sensitive columns (like social security numbers or email addresses) based on the user&apos;s role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE COLUMN MASKING POLICY demo.db.customers.ssn_mask
ON demo.db.customers (ssn)
USING (
  CASE
    WHEN IS_MEMBER(&apos;HR_Compliance&apos;) THEN ssn
    ELSE &apos;XXX-XX-XXXX&apos;
  END
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because these security policies are defined at the engine and semantic tier, they are applied dynamically at query execution time. The underlying Parquet files remain unchanged on S3, allowing you to maintain a single copy of data while securing it for different user groups.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;6. Misconception: &amp;quot;Iceberg Tables Are Tightly Bound to Apache Spark&amp;quot;&lt;/h2&gt;
&lt;p&gt;Because Apache Iceberg was originally designed by Netflix using Java libraries, and early adoptions were heavily focused on Spark pipelines, a lingering misconception is that Iceberg requires a Spark cluster or Java-based environment to operate.&lt;/p&gt;
&lt;h3&gt;Multi-Engine and Language Interoperability&lt;/h3&gt;
&lt;p&gt;Today, Apache Iceberg is supported by almost every major data platform and query engine. You can read and write Iceberg tables using a wide variety of tools, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio:&lt;/strong&gt; An Iceberg-native SQL engine optimized for high-performance BI and interactive queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Flink:&lt;/strong&gt; Optimized for low-latency streaming write pipelines and Change Data Capture (CDC).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trino:&lt;/strong&gt; Optimized for high-throughput ad-hoc SQL querying across diverse sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake and Google BigQuery:&lt;/strong&gt; Both platforms support Iceberg tables as first-class storage targets, allowing you to query Iceberg tables directly on your own S3 or GCS buckets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckDB:&lt;/strong&gt; A local, single-node SQL engine that can read Iceberg metadata and query Parquet files directly on your laptop.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Non-Java SDKs: PyIceberg and Rust&lt;/h3&gt;
&lt;p&gt;The introduction of &lt;strong&gt;PyIceberg&lt;/strong&gt; (a pure Python implementation of the Iceberg specification) and the &lt;strong&gt;Iceberg-Rust&lt;/strong&gt; libraries has decoupled the format from the Java Virtual Machine (JVM).&lt;/p&gt;
&lt;p&gt;Data scientists and machine learning engineers can now read Iceberg tables directly into Python dataframes (like Pandas or Polars) without running a Java gateway or Spark session:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyiceberg.catalog import load_catalog

# Connect to the Apache Polaris REST Catalog
catalog = load_catalog(
    &amp;quot;polaris&amp;quot;,
    **{
        &amp;quot;uri&amp;quot;: &amp;quot;http://polaris-service:8181/api/v1&amp;quot;,
        &amp;quot;token&amp;quot;: &amp;quot;my-oauth-token&amp;quot;,
        &amp;quot;warehouse&amp;quot;: &amp;quot;demo&amp;quot;
    }
)

# Load the Iceberg table metadata
table = catalog.load_table(&amp;quot;db.user_events&amp;quot;)

# Query and load data directly into a Polars DataFrame
df = table.scan(
    row_filter=&amp;quot;event_time &amp;gt;= &apos;2026-05-01T00:00:00Z&apos;&amp;quot;,
    selected_fields=(&amp;quot;user_id&amp;quot;, &amp;quot;event_time&amp;quot;)
).to_arrow()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This flexibility allows organizations to build unified data pipelines where data engineers use Java/Scala in Spark for heavy transformations, while data scientists use Python/PyIceberg on their local workstations to train models on the same datasets.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;7. Comparative Summary of Misconceptions&lt;/h2&gt;
&lt;p&gt;To clarify these architectural truths, refer to the summary reference table below:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Feature / Topic&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Legacy Misconception&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Modern Architectural Truth&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Compute Engines&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Locked into a single database runtime or storage vendor.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Decoupled; Spark, Dremio, Flink, and Snowflake query the same files.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Catalogs&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Iceberg is a standalone catalog that replaces Glue, Polaris, or Nessie.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Iceberg is the table metadata format; catalogs coordinate atomic pointer commits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Partitioning&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Partitioning layout is rigid, leaks into SQL, and requires full table rewrites.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Hidden Partitioning resolves query leakage; partition specs evolve instantly without rewrites.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Storage Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Retaining snapshots for time travel duplicates data and inflates S3 bills.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Data files are shared across snapshots; only updates/deletes write new files.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Open object storage cannot support fine-grained RBAC, masking, or row filters.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;REST Catalogs vend temporary credentials; Dremio enforces Row/Column RBAC.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Language Lock-in&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Iceberg requires Spark, Java, or JVM-based environments.&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Decoupled; PyIceberg and Iceberg-Rust support Python, DuckDB, and Rust natively.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;By understanding the underlying mechanics of Apache Iceberg and the open lakehouse architecture, you can avoid common design mistakes. Decoupling storage from compute, utilizing REST catalogs for security, leveraging hidden partitioning for schema evolution, and running regular snapshot expiration procedures ensures that your data platform remains performant, secure, and adaptable as your workloads scale.&lt;/p&gt;
&lt;p&gt;If you are ready to evaluate format performance under real workloads, check out our guide on &lt;a href=&quot;/benchmarks/open-table-formats/&quot;&gt;benchmarking open table formats&lt;/a&gt; or learn more about &lt;a href=&quot;/apache-iceberg/&quot;&gt;Apache Iceberg Architecture&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Migrating to Apache Iceberg: Strategies for Every Source System</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:14:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 15, the final article of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Part 14&lt;/a&gt; covered hands-on Dremio Cloud. This article covers the three migration strategies and how to execute a zero-downtime migration using the view swap pattern.&lt;/p&gt;
&lt;p&gt;Most organizations do not start with Iceberg. They have years of data in Hive tables, data warehouses, CSV files, databases, and Parquet directories. Moving this data to Iceberg is not an all-or-nothing project. The best migrations happen incrementally, one dataset at a time, with no disruption to existing consumers.&lt;/p&gt;
&lt;h2&gt;Three Migration Strategies&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/migration-strategies.png&quot; alt=&quot;Three paths to Iceberg: in-place migration, full rewrite, and shadow migration&quot;&gt;&lt;/p&gt;
&lt;h3&gt;1. In-Place Migration (Metadata Only)&lt;/h3&gt;
&lt;p&gt;In-place migration creates Iceberg metadata over existing Parquet or ORC files without copying or moving them. The data files stay exactly where they are; only new Iceberg metadata is created to track them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Spark example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL system.migrate(&apos;db.existing_hive_table&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This converts a Hive table to Iceberg by scanning its files and creating the Iceberg metadata tree (metadata.json, manifest list, manifest files) that references them. The Parquet files are untouched.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Fast. No data movement. The table becomes queryable as Iceberg immediately.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; The existing file layout (sizes, partitioning, sort order) is inherited. If the original files are poorly organized, you inherit those problems. Requires the original files to be in Parquet or ORC format.&lt;/p&gt;
&lt;h3&gt;2. Full Rewrite (CTAS)&lt;/h3&gt;
&lt;p&gt;A full rewrite reads data from any source and writes it as a new Iceberg table with optimal partitioning and file sizes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Spark
CREATE TABLE iceberg_catalog.analytics.orders
USING iceberg
PARTITIONED BY (day(order_date))
AS SELECT * FROM hive_catalog.legacy.orders

-- Dremio
CREATE TABLE analytics.orders
PARTITION BY (day(order_date))
AS SELECT * FROM legacy_source.public.orders
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Best result. Optimal file sizes, correct sort order, proper partitioning. The table is perfectly organized from day one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Requires reading and writing all data, which takes time and compute resources. The source system must be available during the migration.&lt;/p&gt;
&lt;h3&gt;3. Shadow Migration (Build and Swap)&lt;/h3&gt;
&lt;p&gt;Shadow migration builds the Iceberg table alongside the existing source, then swaps consumers from old to new when ready:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a new Iceberg table with the desired schema and partitioning&lt;/li&gt;
&lt;li&gt;Backfill historical data from the legacy source&lt;/li&gt;
&lt;li&gt;Set up incremental sync to keep the Iceberg table current&lt;/li&gt;
&lt;li&gt;Validate data quality between old and new&lt;/li&gt;
&lt;li&gt;Swap consumer views from legacy to Iceberg&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Zero downtime. Consumers never see a disruption. You can validate the migration before committing to it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Temporarily doubles storage costs. Requires maintaining two copies during the transition.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Strategy&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/migration-decision-tree.png&quot; alt=&quot;Decision tree for selecting the right migration strategy based on downtime tolerance and layout changes&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Recommended Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hive table (Parquet files)&lt;/td&gt;
&lt;td&gt;In-place migration, then compact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data warehouse (Snowflake, Redshift)&lt;/td&gt;
&lt;td&gt;Full rewrite via &lt;a href=&quot;https://www.dremio.com/platform/federation/&quot;&gt;Dremio federation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CSV/JSON files in S3&lt;/td&gt;
&lt;td&gt;Full rewrite with &lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-apache-iceberg-tables-with-dremio/&quot;&gt;COPY INTO&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL/MySQL&lt;/td&gt;
&lt;td&gt;Full rewrite or shadow migration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delta Lake tables&lt;/td&gt;
&lt;td&gt;In-place conversion or rewrite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production system (no downtime)&lt;/td&gt;
&lt;td&gt;Shadow migration with view swap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;The View Swap Pattern&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/view-swap-pattern.png&quot; alt=&quot;The zero-downtime view swap pattern: views point to legacy first, then switch to Iceberg&quot;&gt;&lt;/p&gt;
&lt;p&gt;The view swap pattern is the recommended approach for production migrations. It uses &lt;a href=&quot;https://www.dremio.com/platform/semantic-layer/&quot;&gt;Dremio&apos;s semantic layer&lt;/a&gt; to create an abstraction between consumers and the underlying data:&lt;/p&gt;
&lt;h3&gt;Phase 1: Federation&lt;/h3&gt;
&lt;p&gt;Create views in Dremio that point to the legacy data source:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.orders AS
SELECT order_id, customer_id, order_date, amount, status, region
FROM postgres_source.public.orders
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All consumers (dashboards, reports, notebooks) query through these views. They do not know or care where the data physically lives.&lt;/p&gt;
&lt;h3&gt;Phase 2: Build Iceberg&lt;/h3&gt;
&lt;p&gt;Create and populate the Iceberg table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create the Iceberg table
CREATE TABLE iceberg_data.analytics.orders (
    order_id BIGINT, customer_id BIGINT,
    order_date DATE, amount DECIMAL(10,2),
    status VARCHAR, region VARCHAR
) PARTITION BY (day(order_date))

-- Backfill from the legacy source
INSERT INTO iceberg_data.analytics.orders
SELECT * FROM postgres_source.public.orders
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Phase 3: Validate&lt;/h3&gt;
&lt;p&gt;Compare the two datasets to confirm data integrity:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  (SELECT COUNT(*) FROM postgres_source.public.orders) AS legacy_count,
  (SELECT COUNT(*) FROM iceberg_data.analytics.orders) AS iceberg_count
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Beyond row counts, validate aggregates (total amounts, distinct customer counts) and spot-check individual records. A comprehensive validation script should compare:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Total row count&lt;/li&gt;
&lt;li&gt;Column-level checksums or hash aggregates&lt;/li&gt;
&lt;li&gt;Distinct value counts for key columns&lt;/li&gt;
&lt;li&gt;Boundary values (MIN/MAX) for numeric and date columns&lt;/li&gt;
&lt;li&gt;Sample of specific records matched by primary key&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Only proceed to the swap after all validation checks pass.&lt;/p&gt;
&lt;h3&gt;Phase 4: Swap&lt;/h3&gt;
&lt;p&gt;Update the view to point to the Iceberg table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW analytics.orders AS
SELECT order_id, customer_id, order_date, amount, status, region
FROM iceberg_data.analytics.orders
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Consumers notice nothing. The view name is the same. The query interface is the same. But now the data is served from Iceberg with all of its advantages: &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;time travel&lt;/a&gt;, &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;hidden partitioning&lt;/a&gt;, &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;metadata-driven pruning&lt;/a&gt;, and &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;automatic optimization&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Migrating One Table at a Time&lt;/h2&gt;
&lt;p&gt;The view swap pattern enables incremental migration. You do not need to migrate everything at once:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; Migrate the highest-value table (e.g., orders)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 2:&lt;/strong&gt; Migrate the next table (e.g., customers)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continue&lt;/strong&gt; until all critical tables are on Iceberg&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;During the transition, &lt;a href=&quot;https://www.dremio.com/platform/federation/&quot;&gt;Dremio&apos;s federation&lt;/a&gt; queries legacy and Iceberg tables together. A join between a PostgreSQL table and an Iceberg table works the same as a join between two Iceberg tables. The migration is invisible to consumers.&lt;/p&gt;
&lt;h2&gt;Post-Migration Checklist&lt;/h2&gt;
&lt;p&gt;After migrating each table:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;OPTIMIZE TABLE&lt;/a&gt; to ensure optimal file sizes&lt;/li&gt;
&lt;li&gt;Set up automatic optimization through &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Add wikis and tags for the &lt;a href=&quot;https://www.dremio.com/platform/ai/&quot;&gt;AI agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Verify &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;metadata table&lt;/a&gt; health checks&lt;/li&gt;
&lt;li&gt;Decommission the legacy source after the retention period&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Migration Pitfalls&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Migrating without testing query performance:&lt;/strong&gt; Always benchmark critical queries against the new Iceberg table before switching production traffic. Iceberg&apos;s partition layout and file organization affect performance, and a migration can make some queries faster but others slower if the partition strategy is wrong.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skipping the validation phase:&lt;/strong&gt; Data discrepancies between the old and new systems are more common than expected. Schema differences, timezone handling, null semantics, and data type precision can all cause subtle mismatches. Validate thoroughly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrating everything at once:&lt;/strong&gt; Large &amp;quot;big bang&amp;quot; migrations carry high risk. If something goes wrong, rolling back is complex and time-consuming. Migrate one table at a time, validate each one, and build confidence incrementally.&lt;/p&gt;
&lt;p&gt;This completes the Apache Iceberg Masterclass. The series covered table formats, metadata, performance, partitioning, writes, catalogs, maintenance, tooling, and migration. For hands-on practice, start a &lt;a href=&quot;https://www.dremio.com/get-started/&quot;&gt;Dremio Cloud trial&lt;/a&gt; and follow the workflow in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Part 14&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hands-On with Apache Iceberg Using Dremio Cloud</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:13:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 14 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Part 13&lt;/a&gt; covered streaming approaches. This article is a practical walkthrough of working with Iceberg on &lt;a href=&quot;https://www.dremio.com/get-started/&quot;&gt;Dremio Cloud&lt;/a&gt;, covering table creation, data ingestion, optimization, semantic layer construction, and AI-powered analytics.&lt;/p&gt;
&lt;h2&gt;Getting Started&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/dremio-iceberg-workflow.png&quot; alt=&quot;From zero to Iceberg in six steps on Dremio Cloud&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Step 1: Sign Up and Connect Storage&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started/&quot;&gt;Create a Dremio Cloud account&lt;/a&gt; (free trial available)&lt;/li&gt;
&lt;li&gt;Add a cloud storage source (S3, ADLS, or GCS) through the Sources panel&lt;/li&gt;
&lt;li&gt;Configure credentials and target bucket&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio creates an &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Open Catalog&lt;/a&gt; for your Iceberg tables automatically. This Polaris-based catalog handles metadata management, access control, and automatic optimization.&lt;/p&gt;
&lt;h3&gt;Step 2: Create Iceberg Tables&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    amount DECIMAL(10,2),
    status VARCHAR,
    region VARCHAR
)
PARTITION BY (day(order_date))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a table with &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;hidden partitioning&lt;/a&gt; by day. Users query on &lt;code&gt;order_date&lt;/code&gt; naturally; the engine handles partition pruning automatically.&lt;/p&gt;
&lt;h3&gt;Step 3: Ingest Data&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;From files in object storage:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;COPY INTO analytics.orders
FROM &apos;@my_s3_source/raw/orders/&apos;
FILE_FORMAT &apos;parquet&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;From another table or source:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO analytics.orders
SELECT * FROM postgres_source.public.orders
WHERE order_date &amp;gt;= &apos;2024-01-01&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/platform/federation/&quot;&gt;Dremio&apos;s federation&lt;/a&gt; can query data in PostgreSQL, MySQL, Oracle, MongoDB, S3 files, and other sources directly. You can migrate data into Iceberg tables with a single INSERT...SELECT statement.&lt;/p&gt;
&lt;h2&gt;The Dremio Platform&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/dremio-platform-features.png&quot; alt=&quot;Dremio Cloud features for Iceberg including Open Catalog, federation, semantic layer, and AI&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Columnar Cloud Cache&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/dremios-columnar-cloud-cache-c3/&quot;&gt;Columnar Cloud Cache (C3)&lt;/a&gt; stores frequently accessed Iceberg data on local NVMe SSDs attached to the query engine nodes. When a query accesses data for the first time, Dremio caches the relevant columns locally. Subsequent queries against the same data read from local SSD instead of remote object storage, reducing latency from hundreds of milliseconds to single-digit milliseconds.&lt;/p&gt;
&lt;p&gt;C3 operates transparently. You do not need to configure which data to cache. Dremio tracks access patterns and caches the most-queried data automatically.&lt;/p&gt;
&lt;h3&gt;Connecting BI Tools&lt;/h3&gt;
&lt;p&gt;Dremio exposes Iceberg data through ODBC, JDBC, and Arrow Flight endpoints. Any BI tool (Tableau, Power BI, Looker, Superset) can connect to Dremio and query Iceberg tables as if they were a traditional database. The semantic layer ensures consistent governance and naming across all connected tools.&lt;/p&gt;
&lt;h3&gt;Semantic Layer&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/platform/semantic-layer/&quot;&gt;semantic layer&lt;/a&gt; lets you create governed SQL views that serve as the interface between raw data and consumers:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.customer_orders AS
SELECT
    o.customer_id,
    c.customer_name,
    c.region,
    SUM(o.amount) AS total_spend,
    COUNT(*) AS order_count
FROM analytics.orders o
JOIN analytics.customers c ON o.customer_id = c.customer_id
GROUP BY o.customer_id, c.customer_name, c.region
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Add wikis and tags to views and tables through the Dremio UI. These descriptions help other users find and understand data, and they power the &lt;a href=&quot;https://www.dremio.com/platform/ai/&quot;&gt;AI agent&apos;s&lt;/a&gt; ability to generate accurate SQL from natural language.&lt;/p&gt;
&lt;h3&gt;Reflections (Query Acceleration)&lt;/h3&gt;
&lt;p&gt;Dremio Reflections are precomputed materializations that automatically accelerate queries without requiring changes to your SQL. When you create a reflection on a view or table, Dremio precomputes the results and stores them as optimized Iceberg tables on fast storage:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create an aggregation reflection for fast dashboard queries
ALTER TABLE analytics.customer_orders
  CREATE AGGREGATE REFLECTION customer_orders_agg
  USING DIMENSIONS (region, order_date)
  MEASURES (total_spend SUM, order_count SUM)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When a query matches the reflection&apos;s definition, Dremio serves it from the precomputed data instead of scanning the full table. Queries that take 30 seconds against raw data can complete in under 1 second with reflections. The query optimizer chooses the reflection transparently, so users and applications do not need to know reflections exist.&lt;/p&gt;
&lt;h3&gt;Data Governance&lt;/h3&gt;
&lt;p&gt;Dremio provides column-level access control and row-level filtering directly in the &lt;a href=&quot;https://www.dremio.com/platform/semantic-layer/&quot;&gt;semantic layer&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a view that masks PII for non-privileged users
CREATE VIEW analytics.orders_masked AS
SELECT
    order_id,
    CASE WHEN is_member(&apos;finance_team&apos;) THEN customer_name
         ELSE &apos;***MASKED***&apos; END AS customer_name,
    order_date,
    amount
FROM analytics.orders
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Governance policies defined in the semantic layer apply consistently regardless of which tool (BI dashboard, Python notebook, AI agent) queries the data. This approach is more maintainable than duplicating access policies in every consuming application.&lt;/p&gt;
&lt;h3&gt;Query Federation&lt;/h3&gt;
&lt;p&gt;One of Dremio&apos;s unique capabilities is querying Iceberg tables alongside data in other systems:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Iceberg table with a PostgreSQL table
SELECT i.order_id, i.amount, p.payment_status
FROM analytics.orders i
JOIN postgres_source.public.payments p
ON i.order_id = p.order_id
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This eliminates the need to move all data into Iceberg before you can query it. You can &lt;a href=&quot;https://www.dremio.com/blog/the-journey-from-scattered-data-to-an-apache-iceberg-lakehouse-with-governed-agentic-analytics/&quot;&gt;start with federation and migrate incrementally&lt;/a&gt;. Federation is especially useful during &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;migration&lt;/a&gt;: query legacy systems and Iceberg tables side by side, then swap the underlying source when you are ready.&lt;/p&gt;
&lt;h2&gt;Essential SQL Operations&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/dremio-sql-examples.png&quot; alt=&quot;Four essential Iceberg SQL operations on Dremio: CREATE, COPY INTO, OPTIMIZE, and time travel&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Table Optimization&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Compact small files
OPTIMIZE TABLE analytics.orders REWRITE DATA USING BIN_PACK

-- Compact with sorting for better file skipping
OPTIMIZE TABLE analytics.orders REWRITE DATA USING SORT (order_date, customer_id)

-- Expire old snapshots
ALTER TABLE analytics.orders EXPIRE SNAPSHOTS OLDER_THAN = &apos;2024-04-01 00:00:00&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For tables managed by &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Open Catalog&lt;/a&gt;, Dremio runs &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;automatic table optimization&lt;/a&gt; in the background, handling compaction, expiry, and orphan cleanup without user intervention.&lt;/p&gt;
&lt;h3&gt;Time Travel&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query the table as of a specific timestamp
SELECT * FROM analytics.orders
AT TIMESTAMP &apos;2024-03-01 00:00:00&apos;

-- Compare current data to a previous snapshot
SELECT
    current_data.region,
    current_data.total - old_data.total AS growth
FROM (SELECT region, SUM(amount) AS total FROM analytics.orders GROUP BY region) current_data
JOIN (
    SELECT region, SUM(amount) AS total
    FROM analytics.orders AT TIMESTAMP &apos;2024-01-01&apos;
    GROUP BY region
) old_data ON current_data.region = old_data.region
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Metadata Inspection&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Check table health
SELECT AVG(file_size_in_bytes)/1048576 AS avg_mb, COUNT(*) AS files
FROM TABLE(table_files(&apos;analytics.orders&apos;))

-- Review recent snapshots
SELECT committed_at, operation, summary
FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))
ORDER BY committed_at DESC LIMIT 5
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s built-in &lt;a href=&quot;https://www.dremio.com/platform/ai/&quot;&gt;AI agent&lt;/a&gt; converts natural language questions into SQL queries using the semantic layer&apos;s wikis and tags as context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Show me the top 10 customers by total spend this quarter&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;What was the month-over-month revenue growth by region?&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;Which products had the highest return rate last month?&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The AI agent generates standard SQL, meaning the results are transparent and auditable. Users can see exactly what SQL was generated, verify it, and refine it. This is different from black-box AI analytics tools that hide the underlying logic.&lt;/p&gt;
&lt;h3&gt;MCP Server for External AI Agents&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://www.dremio.com/blog/getting-started-with-the-dremio-mcp-server/&quot;&gt;MCP Server&lt;/a&gt; extends Dremio&apos;s data access to external AI agents and tools through the Model Context Protocol. LLMs running in Claude, ChatGPT, or custom agent frameworks can query your Iceberg lakehouse through MCP, inheriting all the governance, semantic context, and optimization that Dremio provides.&lt;/p&gt;
&lt;p&gt;This positions Dremio as the data layer for &lt;a href=&quot;https://www.dremio.com/platform/ai/&quot;&gt;agentic AI&lt;/a&gt; workflows: the AI agent asks questions in natural language, MCP translates them into governed SQL, and Dremio returns the results from optimized Iceberg tables.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Part 15&lt;/a&gt; covers strategies for migrating existing data into Iceberg.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Approaches to Streaming Data into Apache Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:12:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 13 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Part 12&lt;/a&gt; covered Python and MPP engines. This article covers the three primary approaches to streaming data into Iceberg tables and the operational trade-offs each creates.&lt;/p&gt;
&lt;p&gt;Iceberg was designed for batch analytics, but most production data arrives continuously. Streaming ingestion bridges this gap by committing data to Iceberg tables at regular intervals. The challenge is that frequent commits create the &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;small file problem&lt;/a&gt;, and managing that trade-off between data freshness and table health is the central concern of streaming to Iceberg.&lt;/p&gt;
&lt;h2&gt;Three Streaming Architectures&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/streaming-approaches.png&quot; alt=&quot;Three approaches to streaming data into Iceberg: Spark, Flink, and Kafka Connect&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Spark Structured Streaming&lt;/h3&gt;
&lt;p&gt;Spark Structured Streaming processes data in micro-batches and commits to Iceberg at configurable intervals:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = spark.readStream.format(&amp;quot;kafka&amp;quot;) \
    .option(&amp;quot;subscribe&amp;quot;, &amp;quot;events&amp;quot;) \
    .load()

df.writeStream.format(&amp;quot;iceberg&amp;quot;) \
    .outputMode(&amp;quot;append&amp;quot;) \
    .option(&amp;quot;checkpointLocation&amp;quot;, &amp;quot;s3://checkpoint/events&amp;quot;) \
    .trigger(processingTime=&amp;quot;60 seconds&amp;quot;) \
    .toTable(&amp;quot;analytics.events&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each trigger creates a new Iceberg commit with the accumulated data. A 60-second trigger produces 1,440 commits per day, each adding a small number of files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Seconds to minutes (configurable via trigger interval).
&lt;strong&gt;Small file impact:&lt;/strong&gt; Moderate. Longer trigger intervals produce fewer, larger files.
&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using Spark for batch processing who want to add near-real-time ingestion.&lt;/p&gt;
&lt;h3&gt;Apache Flink Iceberg Sink&lt;/h3&gt;
&lt;p&gt;Flink processes events continuously and commits to Iceberg at checkpoint intervals:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Flink SQL
INSERT INTO iceberg_catalog.analytics.events
SELECT event_id, event_time, payload
FROM kafka_source
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Flink&apos;s checkpointing mechanism determines commit frequency. A 30-second checkpoint interval produces commits every 30 seconds with whatever data has accumulated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Exactly-once semantics:&lt;/strong&gt; Flink&apos;s checkpoint mechanism provides exactly-once delivery guarantees to Iceberg. If a Flink job crashes, it recovers from its last checkpoint and replays any data that was not yet committed to Iceberg. This means no duplicate records and no data loss, which is critical for financial and transactional data pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partitioned writes:&lt;/strong&gt; Flink can route events to partitions dynamically based on partition transforms. Combined with Iceberg&apos;s &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;hidden partitioning&lt;/a&gt;, this means streaming data lands in the correct partition directory automatically without any special logic in the streaming application.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Upserts and CDC:&lt;/strong&gt; Flink supports changelog streams (insert, update, delete operations) and can write them to Iceberg as equality deletes and data files. This enables CDC (change data capture) patterns where a database&apos;s transaction log is streamed directly into an Iceberg table, maintaining a near-real-time copy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Seconds (tied to checkpoint interval).
&lt;strong&gt;Small file impact:&lt;/strong&gt; High. Frequent checkpoints produce many small files.
&lt;strong&gt;Best for:&lt;/strong&gt; Teams needing the lowest-latency streaming with exactly-once semantics and CDC support.&lt;/p&gt;
&lt;h3&gt;Kafka Connect Iceberg Sink&lt;/h3&gt;
&lt;p&gt;The Iceberg Sink Connector reads directly from Kafka topics and writes to Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;iceberg-sink&amp;quot;,
  &amp;quot;config&amp;quot;: {
    &amp;quot;connector.class&amp;quot;: &amp;quot;org.apache.iceberg.connect.IcebergSinkConnector&amp;quot;,
    &amp;quot;topics&amp;quot;: &amp;quot;events&amp;quot;,
    &amp;quot;iceberg.catalog.type&amp;quot;: &amp;quot;rest&amp;quot;,
    &amp;quot;iceberg.catalog.uri&amp;quot;: &amp;quot;https://catalog.example.com&amp;quot;,
    &amp;quot;iceberg.tables&amp;quot;: &amp;quot;analytics.events&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Minutes (Kafka Connect batches records before committing).
&lt;strong&gt;Small file impact:&lt;/strong&gt; Lower than Spark/Flink because commits are less frequent.
&lt;strong&gt;Best for:&lt;/strong&gt; Organizations with existing Kafka infrastructure that want a managed connector approach.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg Sink Connector:&lt;/strong&gt; The community-maintained Iceberg Sink Connector for Kafka Connect supports schema evolution from Kafka&apos;s Schema Registry, automatic table creation, and partition routing. It reads records from Kafka topics, buffers them in memory, and commits to Iceberg in configurable batch intervals.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational simplicity:&lt;/strong&gt; Kafka Connect is a managed framework. You deploy the connector configuration, and Kafka Connect handles scaling, offset management, and fault recovery. There is no custom application code to write or maintain. For organizations that already run Kafka Connect for other sinks (databases, search indexes), adding an Iceberg sink is straightforward.&lt;/p&gt;
&lt;h2&gt;The Streaming + Compaction Cycle&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/streaming-compaction-cycle.png&quot; alt=&quot;Why streaming creates small files and how compaction fixes them in a continuous cycle&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every streaming approach shares the same fundamental problem: frequent commits produce small files. The solution is to pair streaming ingestion with aggressive &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;compaction&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A typical production pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Stream data in&lt;/strong&gt; via Flink or Spark with 60-second commit intervals&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run compaction&lt;/strong&gt; every hour to merge small files from the last hour into optimally-sized files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expire snapshots&lt;/strong&gt; daily to clean up the accumulated snapshot metadata&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;Dremio&apos;s automatic table optimization&lt;/a&gt; handles this compaction automatically for tables managed by Open Catalog. AWS &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;S3 Tables&lt;/a&gt; also provides built-in compaction for streaming workloads.&lt;/p&gt;
&lt;h2&gt;The Latency vs. Maintenance Trade-off&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/latency-vs-maintenance.png&quot; alt=&quot;The spectrum from real-time to batch showing how latency affects small file production&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Commit Frequency&lt;/th&gt;
&lt;th&gt;Files/Day&lt;/th&gt;
&lt;th&gt;Compaction Need&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flink (30s checkpoint)&lt;/td&gt;
&lt;td&gt;Every 30 seconds&lt;/td&gt;
&lt;td&gt;5,000+&lt;/td&gt;
&lt;td&gt;Very high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark (60s trigger)&lt;/td&gt;
&lt;td&gt;Every 60 seconds&lt;/td&gt;
&lt;td&gt;2,500+&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark (5min trigger)&lt;/td&gt;
&lt;td&gt;Every 5 minutes&lt;/td&gt;
&lt;td&gt;300+&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka Connect&lt;/td&gt;
&lt;td&gt;Every few minutes&lt;/td&gt;
&lt;td&gt;500+&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch (hourly)&lt;/td&gt;
&lt;td&gt;Every hour&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key insight: you do not always need sub-second latency. Most dashboards refresh every 5-15 minutes. If your consumers can tolerate 5-minute data freshness, using a 5-minute trigger interval produces 90% fewer small files and dramatically reduces compaction overhead.&lt;/p&gt;
&lt;h2&gt;Production Streaming Architecture&lt;/h2&gt;
&lt;p&gt;A production streaming-to-Iceberg pipeline typically includes four components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Message queue&lt;/strong&gt; (Kafka, Kinesis, Pulsar): Buffers events from source systems&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream processor&lt;/strong&gt; (Flink, Spark Streaming): Transforms and writes to Iceberg&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction service&lt;/strong&gt; (&lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;Dremio auto-optimization&lt;/a&gt;, Spark scheduled jobs): Merges small files on a recurring schedule&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; (&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;metadata tables&lt;/a&gt;): Tracks file counts, sizes, and commit frequency&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The most common mistake in streaming Iceberg architectures is deploying the stream processor without the compaction service. Without compaction, query performance degrades within days. Always deploy both together.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Approach&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sub-second latency&lt;/td&gt;
&lt;td&gt;Flink + aggressive compaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1-5 minute latency&lt;/td&gt;
&lt;td&gt;Spark Structured Streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Existing Kafka infrastructure&lt;/td&gt;
&lt;td&gt;Kafka Connect sink&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minimal ops overhead&lt;/td&gt;
&lt;td&gt;Batch ingestion with &lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-apache-iceberg-tables-with-dremio/&quot;&gt;Dremio COPY INTO&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple downstream engines&lt;/td&gt;
&lt;td&gt;Any approach + REST catalog (&lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Monitoring Streaming Health&lt;/h3&gt;
&lt;p&gt;After deploying a streaming pipeline, monitor these metrics daily using &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;metadata tables&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Commit frequency:&lt;/strong&gt; How many snapshots are being created per hour?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Average file size:&lt;/strong&gt; Is the small file problem growing?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction lag:&lt;/strong&gt; Are compaction jobs keeping up with the write rate?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;End-to-end latency:&lt;/strong&gt; How long between an event occurring and it being queryable in Iceberg?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A well-tuned streaming pipeline commits every 1-5 minutes, produces files of 32-128 MB per commit, and has compaction running every 30-60 minutes to consolidate the small files into 256 MB targets.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Part 14&lt;/a&gt; provides a hands-on walkthrough of Iceberg on Dremio Cloud.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using Apache Iceberg with Python and MPP Query Engines</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:11:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 12 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Part 11&lt;/a&gt; covered metadata tables. This article covers the two main ways to access Iceberg data: directly from Python libraries and through MPP (massively parallel processing) query engines.&lt;/p&gt;
&lt;h2&gt;The Python Ecosystem for Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/python-iceberg-stack.png&quot; alt=&quot;How Python libraries and MPP engines connect to Iceberg tables&quot;&gt;&lt;/p&gt;
&lt;h3&gt;PyIceberg: Native Python Access&lt;/h3&gt;
&lt;p&gt;PyIceberg is the official Python library for Apache Iceberg. It reads Iceberg metadata directly and can scan data files without an external query engine.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyiceberg.catalog import load_catalog

# Connect to a REST catalog
catalog = load_catalog(&amp;quot;my_catalog&amp;quot;, **{
    &amp;quot;type&amp;quot;: &amp;quot;rest&amp;quot;,
    &amp;quot;uri&amp;quot;: &amp;quot;https://catalog.example.com&amp;quot;,
})

# Load and scan a table
table = catalog.load_table(&amp;quot;analytics.orders&amp;quot;)
scan = table.scan(row_filter=&amp;quot;amount &amp;gt; 100&amp;quot;)
df = scan.to_pandas()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/pyiceberg-workflow.png&quot; alt=&quot;The five-step PyIceberg workflow from catalog connection to analysis&quot;&gt;&lt;/p&gt;
&lt;p&gt;PyIceberg leverages Iceberg&apos;s &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;metadata-driven pruning&lt;/a&gt;: the &lt;code&gt;row_filter&lt;/code&gt; is pushed down to manifest evaluation, so only relevant data files are read. For reading subsets of large tables into Python for analysis or ML training, this is remarkably efficient.&lt;/p&gt;
&lt;p&gt;PyIceberg also supports writes (appending data from Arrow tables), schema evolution, and table management operations. It connects to any catalog that implements the REST protocol, including &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;DuckDB: SQL-Based Python Analysis&lt;/h3&gt;
&lt;p&gt;DuckDB can read Iceberg tables through its Iceberg extension:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import duckdb

conn = duckdb.connect()
conn.execute(&amp;quot;INSTALL iceberg; LOAD iceberg;&amp;quot;)

df = conn.execute(&amp;quot;&amp;quot;&amp;quot;
    SELECT customer_id, SUM(amount) as total
    FROM iceberg_scan(&apos;s3://warehouse/orders&apos;)
    GROUP BY customer_id
&amp;quot;&amp;quot;&amp;quot;).fetchdf()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;DuckDB processes the query locally using its columnar execution engine, which is significantly faster than pandas for analytical queries. It supports Iceberg&apos;s partition pruning and column statistics for file skipping. DuckDB runs entirely in-process, so there is no separate server to manage. This makes it a strong choice for local analysis, CI/CD data validation, and notebooks where starting a Spark cluster would be overkill.&lt;/p&gt;
&lt;p&gt;DuckDB also supports reading Iceberg metadata tables, which means you can use it for &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;table health diagnostics&lt;/a&gt; without standing up a full query engine.&lt;/p&gt;
&lt;h3&gt;Polars: High-Performance DataFrames&lt;/h3&gt;
&lt;p&gt;Polars can read Iceberg tables through its &lt;code&gt;scan_iceberg&lt;/code&gt; method, providing lazy evaluation and parallel processing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl

df = pl.scan_iceberg(&amp;quot;s3://warehouse/orders&amp;quot;).filter(
    pl.col(&amp;quot;amount&amp;quot;) &amp;gt; 100
).collect()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Polars uses a lazy evaluation model: the &lt;code&gt;scan_iceberg&lt;/code&gt; call does not read data immediately. Instead, it builds an execution plan. When &lt;code&gt;collect()&lt;/code&gt; is called, Polars optimizes the plan (predicate pushdown, column pruning, parallel reads) and executes it. For large Iceberg tables, Polars can scan data several times faster than pandas because it uses all available CPU cores and processes data in Apache Arrow columnar format.&lt;/p&gt;
&lt;h3&gt;Writing from Python&lt;/h3&gt;
&lt;p&gt;PyIceberg supports writes through Apache Arrow tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow as pa

# Create an Arrow table with new data
new_data = pa.table({
    &amp;quot;order_id&amp;quot;: [1001, 1002, 1003],
    &amp;quot;amount&amp;quot;: [150.00, 275.50, 89.99],
    &amp;quot;order_date&amp;quot;: [&amp;quot;2024-03-15&amp;quot;, &amp;quot;2024-03-15&amp;quot;, &amp;quot;2024-03-16&amp;quot;],
})

# Append to the Iceberg table
table.append(new_data)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a new Iceberg &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;commit&lt;/a&gt; with the data files, manifests, and metadata. PyIceberg handles the entire write lifecycle, including partition assignment based on the table&apos;s &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;partition spec&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For bulk writes from Python, using PyIceberg with Arrow is often simpler than setting up Spark. However, PyIceberg runs on a single machine, so it is not suitable for writing terabyte-scale datasets. For that, use an MPP engine.&lt;/p&gt;
&lt;h2&gt;MPP Query Engines&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/mpp-engine-comparison.png&quot; alt=&quot;Comparison of MPP engines for Iceberg workloads showing read, write, and maintenance capabilities&quot;&gt;&lt;/p&gt;
&lt;p&gt;For production workloads at scale, Python libraries running on a single machine are not sufficient. MPP engines distribute query execution across multiple nodes, handling petabyte-scale tables with sub-minute response times.&lt;/p&gt;
&lt;h3&gt;Dremio&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Dremio&lt;/a&gt; provides full Iceberg support with several unique capabilities: &lt;a href=&quot;https://www.dremio.com/platform/federation/&quot;&gt;query federation&lt;/a&gt; across Iceberg and non-Iceberg sources, &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;automatic table optimization&lt;/a&gt; through Open Catalog, a &lt;a href=&quot;https://www.dremio.com/platform/semantic-layer/&quot;&gt;semantic layer&lt;/a&gt; for governed access, and &lt;a href=&quot;https://www.dremio.com/platform/ai/&quot;&gt;AI-powered analytics&lt;/a&gt; through its built-in agent and MCP server.&lt;/p&gt;
&lt;p&gt;For Python users, Dremio exposes data through Apache Arrow Flight, which is a high-performance data transfer protocol. Arrow Flight sends data in columnar Arrow format directly to the client, avoiding the serialization overhead of JDBC/ODBC. This makes it 10-100x faster than traditional database connectors for large result sets:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremio_simple_query import DremioConnection

conn = DremioConnection(&amp;quot;https://your-dremio.cloud&amp;quot;, token=&amp;quot;...&amp;quot;)
df = conn.query(&amp;quot;SELECT * FROM analytics.orders WHERE amount &amp;gt; 100&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result is a pandas DataFrame populated via Arrow Flight. Because the data stays in Arrow format end-to-end (Iceberg Parquet to Dremio to Arrow Flight to pandas), there are no format conversion bottlenecks.&lt;/p&gt;
&lt;p&gt;Dremio also provides a &lt;a href=&quot;https://www.dremio.com/blog/dremios-columnar-cloud-cache-c3/&quot;&gt;Columnar Cloud Cache&lt;/a&gt; that stores frequently accessed data on local NVMe drives, making subsequent queries against the same Iceberg data dramatically faster without requiring reflections or materialized views.&lt;/p&gt;
&lt;h3&gt;Spark&lt;/h3&gt;
&lt;p&gt;Apache Spark is the most mature Iceberg engine for both reads and writes. It handles batch ETL, streaming ingestion (&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Part 13&lt;/a&gt;), and all &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;maintenance operations&lt;/a&gt;. Most Iceberg production pipelines use Spark for data ingestion because of its extensive connector ecosystem (Kafka, JDBC, file formats) and its ability to process large volumes across a distributed cluster.&lt;/p&gt;
&lt;p&gt;Spark supports all Iceberg operations: CREATE, INSERT, MERGE, DELETE, UPDATE, schema evolution, partition evolution, and every maintenance procedure (compaction, snapshot expiry, orphan cleanup).&lt;/p&gt;
&lt;h3&gt;Trino&lt;/h3&gt;
&lt;p&gt;Trino (formerly PrestoSQL) is optimized for interactive, ad-hoc queries with low latency. It reads and writes Iceberg tables and supports the REST catalog protocol. Trino is popular for exploration and dashboarding workloads where sub-second response times matter and data is being read rather than written. Its architecture keeps no persistent state, making it easy to scale up and down based on query demand.&lt;/p&gt;
&lt;h3&gt;Other Engines&lt;/h3&gt;
&lt;p&gt;Several other engines provide Iceberg support: AWS Athena (serverless, AWS-native), Snowflake (read-only for external Iceberg tables), StarRocks (sub-second analytics), and Doris (real-time analytics). The Iceberg community maintains a &lt;a href=&quot;https://iceberg.apache.org/multi-engine-support/&quot;&gt;compatibility matrix&lt;/a&gt; showing which engines support which operations.&lt;/p&gt;
&lt;h3&gt;Choosing the Right Approach&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Quick analysis of a table subset&lt;/td&gt;
&lt;td&gt;PyIceberg or DuckDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production dashboards and reports&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://www.dremio.com/platform/&quot;&gt;Dremio&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch ETL pipelines&lt;/td&gt;
&lt;td&gt;Spark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactive data exploration&lt;/td&gt;
&lt;td&gt;Trino or Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML feature extraction&lt;/td&gt;
&lt;td&gt;PyIceberg + pandas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-source analytics&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://www.dremio.com/platform/federation/&quot;&gt;Dremio federation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serverless AWS queries&lt;/td&gt;
&lt;td&gt;Athena&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key takeaway: Python libraries (PyIceberg, DuckDB, Polars) are best for local analysis and development. MPP engines (Dremio, Spark, Trino) are necessary for production-scale analytics. Many teams use both: PyIceberg for data science experimentation, and Dremio for production dashboards and governed access.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Part 13&lt;/a&gt; covers how to stream data into Iceberg tables.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg Metadata Tables: Querying the Internals</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:10:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 11 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Part 10&lt;/a&gt; covered maintenance operations. This article covers the metadata tables that let you inspect Iceberg table internals using standard SQL.&lt;/p&gt;
&lt;p&gt;Iceberg exposes its internal metadata as queryable virtual tables. You can use them to check table health, debug performance issues, audit changes, and build monitoring dashboards. No special tools required, just SQL.&lt;/p&gt;
&lt;h2&gt;The Seven Metadata Tables&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/metadata-tables-overview.png&quot; alt=&quot;The seven Iceberg metadata tables and what each reveals about your table&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Snapshots&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;$snapshots&lt;/code&gt; table lists every snapshot in the table&apos;s history. Each row represents a committed transaction.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Dremio syntax
SELECT * FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))

-- Spark syntax
SELECT * FROM analytics.orders.snapshots
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key columns: &lt;code&gt;snapshot_id&lt;/code&gt;, &lt;code&gt;committed_at&lt;/code&gt;, &lt;code&gt;operation&lt;/code&gt; (append, overwrite, delete), &lt;code&gt;summary&lt;/code&gt; (files added/removed counts).&lt;/p&gt;
&lt;h3&gt;History&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;$history&lt;/code&gt; table shows the timeline of which snapshot was current at each point in time.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_history(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Files&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;$files&lt;/code&gt; table lists every data file in the current snapshot with detailed statistics.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT file_path, file_size_in_bytes, record_count, partition
FROM TABLE(table_files(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the primary diagnostic table for checking &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;file sizes&lt;/a&gt; and identifying the small file problem.&lt;/p&gt;
&lt;h3&gt;Manifests&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;$manifests&lt;/code&gt; table lists the manifest files for the current snapshot.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT path, length, added_data_files_count, existing_data_files_count
FROM TABLE(table_manifests(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Partitions&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;$partitions&lt;/code&gt; table provides statistics per partition: row counts, file counts, and size.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition, record_count, file_count
FROM TABLE(table_partitions(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Practical Use Cases&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/metadata-use-cases.png&quot; alt=&quot;Three categories of metadata table use cases: monitoring, debugging, and auditing&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Monitoring: Average File Size&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  AVG(file_size_in_bytes) / 1048576 AS avg_file_mb,
  MIN(file_size_in_bytes) / 1048576 AS min_file_mb,
  COUNT(*) AS total_files
FROM TABLE(table_files(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If &lt;code&gt;avg_file_mb&lt;/code&gt; drops below 64, schedule compaction.&lt;/p&gt;
&lt;h3&gt;Debugging: Files Per Partition&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition, COUNT(*) AS files, SUM(record_count) AS rows
FROM TABLE(table_files(&apos;analytics.orders&apos;))
GROUP BY partition
ORDER BY files DESC
LIMIT 20
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partitions with hundreds of files are compaction candidates. Use this query as a daily health check and pipe the results into your monitoring system.&lt;/p&gt;
&lt;h3&gt;Debugging: Sort Order Effectiveness&lt;/h3&gt;
&lt;p&gt;Column statistics in the files table reveal whether your sort order is effective:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  file_path,
  lower_bounds[&apos;customer_id&apos;] AS min_customer_id,
  upper_bounds[&apos;customer_id&apos;] AS max_customer_id
FROM TABLE(table_files(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the min/max ranges overlap heavily across files, the sort order has decayed and compaction with sorting (&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Part 10&lt;/a&gt;) will restore effectiveness.&lt;/p&gt;
&lt;h3&gt;Monitoring: Commit Velocity&lt;/h3&gt;
&lt;p&gt;Track how frequently the table is being written to:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  DATE_TRUNC(&apos;hour&apos;, committed_at) AS hour,
  COUNT(*) AS commits,
  SUM(CAST(summary[&apos;added-data-files&apos;] AS INT)) AS files_added
FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))
WHERE committed_at &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
GROUP BY DATE_TRUNC(&apos;hour&apos;, committed_at)
ORDER BY hour
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;High commit velocity (hundreds of commits per hour) indicates a &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;streaming workload&lt;/a&gt; that needs aggressive compaction.&lt;/p&gt;
&lt;h3&gt;Auditing: Recent Changes&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT committed_at, operation, summary
FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))
ORDER BY committed_at DESC
LIMIT 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This shows the last 10 operations: how many files were added or removed per commit.&lt;/p&gt;
&lt;h2&gt;Time Travel&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/time-travel-metadata.png&quot; alt=&quot;How snapshots enable querying the table at any point in its history&quot;&gt;&lt;/p&gt;
&lt;p&gt;Metadata tables enable time travel queries. Use the snapshot list to find the snapshot ID for a specific point in time, then query the table at that snapshot:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query the table as it existed on February 15
SELECT * FROM analytics.orders
AT SNAPSHOT &apos;1234567890123456789&apos;

-- Or by timestamp
SELECT * FROM analytics.orders
AT TIMESTAMP &apos;2024-02-15 00:00:00&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Time travel is useful for debugging data issues (&amp;quot;what did this table look like before yesterday&apos;s pipeline ran?&amp;quot;), auditing (&amp;quot;what was the account balance at end-of-quarter?&amp;quot;), and reproducible analysis (&amp;quot;run this report against last month&apos;s data&amp;quot;).&lt;/p&gt;
&lt;h3&gt;Incremental Reads&lt;/h3&gt;
&lt;p&gt;Metadata tables also enable incremental processing. By comparing two snapshots, you can identify which files were added between them and process only the new data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Find files added in the last snapshot
SELECT file_path, record_count
FROM TABLE(table_files(&apos;analytics.orders&apos;))
WHERE file_path NOT IN (
  SELECT file_path FROM TABLE(table_files(&apos;analytics.orders&apos;))
  AT SNAPSHOT &apos;1234567890&apos;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern is the foundation for CDC (Change Data Capture) on Iceberg tables: read only what changed since the last processing run, rather than re-scanning the entire table.&lt;/p&gt;
&lt;h3&gt;Rollback&lt;/h3&gt;
&lt;p&gt;If a bad write corrupts your table, use the snapshot list to rollback:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Find the last good snapshot
SELECT snapshot_id, committed_at, operation
FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))
ORDER BY committed_at DESC

-- Rollback to it (Spark)
CALL system.rollback_to_snapshot(&apos;analytics.orders&apos;, 1234567890)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rollback does not delete data. It simply changes the current snapshot pointer to an earlier snapshot, making the table appear as it was at that point. The rolled-back data files remain in storage for potential recovery.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://docs.dremio.com/cloud/sonar/query-manage/querying-metadata/&quot;&gt;Dremio&lt;/a&gt; supports all Iceberg metadata table queries through its TABLE() function syntax and provides time travel in both SQL and its semantic layer.&lt;/p&gt;
&lt;h2&gt;Building a Health Dashboard&lt;/h2&gt;
&lt;p&gt;Combine metadata table queries into a scheduled monitoring job:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Table health summary
SELECT
  (SELECT COUNT(*) FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))) AS snapshots,
  (SELECT COUNT(*) FROM TABLE(table_files(&apos;analytics.orders&apos;))) AS files,
  (SELECT AVG(file_size_in_bytes)/1048576 FROM TABLE(table_files(&apos;analytics.orders&apos;))) AS avg_mb,
  (SELECT COUNT(*) FROM TABLE(table_manifests(&apos;analytics.orders&apos;))) AS manifests
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set alerts when snapshots exceed 1,000, average file size drops below 64 MB, or manifest count exceeds 500.&lt;/p&gt;
&lt;h3&gt;Engine Syntax Variations&lt;/h3&gt;
&lt;p&gt;Different engines use different syntax for metadata tables:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Syntax&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TABLE(table_files(&apos;db.table&apos;))&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark&lt;/td&gt;
&lt;td&gt;&lt;code&gt;db.table.files&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trino&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;quot;db&amp;quot;.&amp;quot;table$files&amp;quot;&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flink&lt;/td&gt;
&lt;td&gt;&lt;code&gt;table$files&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The underlying data is identical; only the SQL syntax differs. Regardless of which engine you use, these metadata tables are the key diagnostic tool for understanding and maintaining Iceberg table health.&lt;/p&gt;
&lt;h3&gt;Automating Decisions with Metadata&lt;/h3&gt;
&lt;p&gt;You can use metadata table queries to drive automated maintenance decisions. For example, a scheduler can check whether compaction is needed before running it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Only compact if average file size is below threshold
SELECT CASE
  WHEN AVG(file_size_in_bytes) / 1048576 &amp;lt; 64 THEN &apos;COMPACT_NEEDED&apos;
  ELSE &apos;HEALTHY&apos;
END AS table_status
FROM TABLE(table_files(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This avoids running compaction on tables that are already well-organized, saving compute costs and preventing unnecessary data rewrites.&lt;/p&gt;
&lt;p&gt;For production environments, integrate these checks into your orchestration tool (Airflow, Dagster, Prefect). Schedule a daily metadata scan across all tables, collect the health metrics, and trigger maintenance jobs only for tables that need them. This approach scales to hundreds of tables without manual oversight. &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;Dremio&apos;s autonomous optimization&lt;/a&gt; automates this entire workflow for tables managed by Open Catalog.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Part 12&lt;/a&gt; covers using Iceberg from Python and MPP query engines.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:09:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 10 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;Part 9&lt;/a&gt; covered how tables degrade. This article covers the four maintenance operations that keep Iceberg tables healthy and the three approaches to running them.&lt;/p&gt;
&lt;h2&gt;The Four Maintenance Operations&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/maintenance-operations.png&quot; alt=&quot;The four Iceberg maintenance operations: compaction, snapshot expiry, orphan cleanup, and manifest rewriting&quot;&gt;&lt;/p&gt;
&lt;h3&gt;1. Compaction (File Rewriting)&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/compaction-before-after.png&quot; alt=&quot;Compaction merging 500 small files into 2 large files with identical data&quot;&gt;&lt;/p&gt;
&lt;p&gt;Compaction reads small files, merges them into optimally-sized files (128-512 MB), and optionally re-sorts the data. It is the most impactful maintenance operation because it directly addresses the &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;small file problem&lt;/a&gt; and restores sort order effectiveness.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In Spark:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL system.rewrite_data_files(&apos;analytics.orders&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;In &lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/&quot;&gt;Dremio&lt;/a&gt;:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;OPTIMIZE TABLE analytics.orders REWRITE DATA USING BIN_PACK
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compaction with sorting rewrites files so that column values are ordered, tightening the min/max statistics and making &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;file skipping&lt;/a&gt; far more effective:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;OPTIMIZE TABLE analytics.orders REWRITE DATA USING SORT (order_date, customer_id)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Snapshot Expiry&lt;/h3&gt;
&lt;p&gt;Snapshot expiry removes old snapshots from the metadata. After expiry, the snapshot and its exclusive data files are eligible for cleanup. You typically retain snapshots for a window (e.g., 7 days) to support time travel, then expire everything older.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Spark
CALL system.expire_snapshots(&apos;analytics.orders&apos;, TIMESTAMP &apos;2024-04-22 00:00:00&apos;)

-- Dremio
ALTER TABLE analytics.orders EXPIRE SNAPSHOTS OLDER_THAN = &apos;2024-04-22 00:00:00&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Orphan File Cleanup&lt;/h3&gt;
&lt;p&gt;After snapshots are expired, the data files they exclusively referenced become orphans. Orphan cleanup scans the storage directory, compares files against the current metadata, and deletes files that are not referenced by any snapshot.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Spark
CALL system.remove_orphan_files(&apos;analytics.orders&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This operation should run after snapshot expiry and with a safety delay (e.g., files older than 3 days) to avoid deleting files from in-progress writes.&lt;/p&gt;
&lt;p&gt;Running orphan cleanup too aggressively can delete files from long-running write operations. A 3-day safety window ensures that any write operation has had time to complete before its files are considered orphans.&lt;/p&gt;
&lt;h3&gt;4. Manifest Rewriting&lt;/h3&gt;
&lt;p&gt;Over many commits, manifests accumulate. A single snapshot&apos;s manifest list might reference hundreds of small manifests from individual commits. Manifest rewriting consolidates them into fewer, larger manifests.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Spark
CALL system.rewrite_manifests(&apos;analytics.orders&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This speeds up scan planning because the engine reads fewer manifest files. Each manifest file requires a separate I/O operation to read, so reducing the count from 500 to 20 eliminates 480 I/O round trips during query planning.&lt;/p&gt;
&lt;h3&gt;Sort-Order Compaction&lt;/h3&gt;
&lt;p&gt;Standard compaction (BIN_PACK) merges small files without changing the data order. Sort-order compaction rewrites files with data sorted by specified columns, which tightens the min/max statistics and makes &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;file skipping&lt;/a&gt; more effective:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Dremio sort-order compaction
OPTIMIZE TABLE analytics.orders REWRITE DATA USING SORT (order_date, customer_id)

-- Spark sort-order compaction
CALL system.rewrite_data_files(
  table =&amp;gt; &apos;analytics.orders&apos;,
  strategy =&amp;gt; &apos;sort&apos;,
  sort_order =&amp;gt; &apos;order_date ASC NULLS LAST, customer_id ASC NULLS LAST&apos;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Sort-order compaction is more expensive than BIN_PACK because it reads, sorts, and rewrites all data. However, the performance improvement for queries that filter on the sorted columns is substantial: file skipping can eliminate 90%+ of data files when the sort columns match common query filters.&lt;/p&gt;
&lt;h3&gt;Data Retention Policies&lt;/h3&gt;
&lt;p&gt;Decide how long to keep historical data accessible through time travel:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Retention Need&lt;/th&gt;
&lt;th&gt;Recommended Snapshot Retention&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Debugging recent issues&lt;/td&gt;
&lt;td&gt;7 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly reporting compliance&lt;/td&gt;
&lt;td&gt;30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulatory audit requirements&lt;/td&gt;
&lt;td&gt;90+ days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage cost optimization&lt;/td&gt;
&lt;td&gt;3-5 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Longer retention means more snapshots, more metadata, and more storage consumed by old data files. Shorter retention reduces costs but limits time travel capabilities.&lt;/p&gt;
&lt;h2&gt;Three Approaches to Maintenance&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/auto-vs-manual-maintenance.png&quot; alt=&quot;Comparison of automated versus manual maintenance approaches&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Manual (Scheduled Jobs)&lt;/h3&gt;
&lt;p&gt;Run maintenance operations on a schedule using Spark, Trino, or Dremio. A typical pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run compaction daily for heavily-written tables&lt;/li&gt;
&lt;li&gt;Expire snapshots older than 7 days&lt;/li&gt;
&lt;li&gt;Remove orphan files older than 3 days&lt;/li&gt;
&lt;li&gt;Rewrite manifests monthly&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Full control over timing and configuration. &lt;strong&gt;Cons:&lt;/strong&gt; Requires operational effort; forgotten or broken jobs lead to degradation.&lt;/p&gt;
&lt;h3&gt;Semi-Automated (Scheduled with Monitoring)&lt;/h3&gt;
&lt;p&gt;Build a monitoring layer that checks table health metrics (&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;Part 9&lt;/a&gt; diagnostics) and triggers maintenance only when thresholds are exceeded (e.g., average file size drops below 64 MB).&lt;/p&gt;
&lt;h3&gt;Fully Automated&lt;/h3&gt;
&lt;p&gt;Use a platform that handles maintenance autonomously. &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;Dremio&apos;s automatic table optimization&lt;/a&gt; runs compaction, expiry, and cleanup for tables managed by Open Catalog without any user configuration. AWS &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;S3 Tables&lt;/a&gt; provides built-in compaction.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Effort&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High (can forget)&lt;/td&gt;
&lt;td&gt;Full control needs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semi-Automated&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Custom thresholds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fully Automated&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Most production tables&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Recommended Maintenance Schedule&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Compaction&lt;/td&gt;
&lt;td&gt;Daily (heavy tables), weekly (light)&lt;/td&gt;
&lt;td&gt;Trigger when avg file size &amp;lt; 64 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot expiry&lt;/td&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;Retain 7-30 days for time travel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orphan cleanup&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;td&gt;Safety delay of 3+ days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manifest rewrite&lt;/td&gt;
&lt;td&gt;Monthly&lt;/td&gt;
&lt;td&gt;When manifest count &amp;gt; 500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For most teams, starting with &lt;a href=&quot;https://www.dremio.com/platform/reflections/&quot;&gt;Dremio&apos;s autonomous optimization&lt;/a&gt; and only adding manual jobs for tables with unusual requirements is the most practical approach.&lt;/p&gt;
&lt;h2&gt;Common Maintenance Pitfalls&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Running compaction during peak query hours:&lt;/strong&gt; Compaction reads and rewrites data files, which competes with analytical queries for I/O bandwidth. Schedule compaction during off-peak hours, or use a separate compute cluster (Spark on EMR) that does not share resources with your query engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Expiring snapshots too aggressively:&lt;/strong&gt; If you expire snapshots while a long-running query is using one of them, the query can fail because the data files it needs might be cleaned up. Always keep snapshots for at least as long as your longest-running query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Forgetting orphan cleanup:&lt;/strong&gt; Many teams run compaction and snapshot expiry but forget orphan cleanup. Without it, compacted and expired data files accumulate indefinitely. Set up orphan cleanup as a weekly job with a 3-day safety window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Not monitoring after migration:&lt;/strong&gt; Tables migrated from Hive or other formats (&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Part 15&lt;/a&gt;) often inherit poor file layouts. Run an immediate compaction pass after any in-place migration.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Part 11&lt;/a&gt; covers how to query the metadata tables that power diagnostics.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Concurrency, Isolation, and MVCC: How Engines Handle Contention</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-10/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-10/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:09:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 10 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Part 9&lt;/a&gt; covered distributed joins. This final article covers how engines handle the inevitable conflict when multiple users read and write the same data simultaneously.&lt;/p&gt;
&lt;p&gt;Every production database serves multiple concurrent users. Without concurrency control, simultaneous reads and writes produce corrupted data, inconsistent query results, or both. The question is not whether to control concurrency, but how much control to impose and what performance to sacrifice for it.&lt;/p&gt;
&lt;h2&gt;The Core Problem&lt;/h2&gt;
&lt;p&gt;Consider two transactions running simultaneously:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transaction A&lt;/strong&gt; reads a customer&apos;s balance (currently $500) and subtracts $100.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transaction B&lt;/strong&gt; reads the same balance ($500) and subtracts $200.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without concurrency control, both transactions read $500, compute their results independently, and write back. Transaction A writes $400. Transaction B overwrites it with $300. The correct result ($200) is never produced. This is a lost update, and it destroys data integrity.&lt;/p&gt;
&lt;h2&gt;Two-Phase Locking (2PL)&lt;/h2&gt;
&lt;p&gt;The oldest approach is locking. Two-Phase Locking enforces a simple rule: a transaction must acquire all the locks it needs before releasing any of them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/locking-vs-mvcc.png&quot; alt=&quot;Two-Phase Locking versus MVCC showing how locking blocks readers while MVCC allows concurrent access&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Shared locks&lt;/strong&gt; allow multiple readers but block writers. &lt;strong&gt;Exclusive locks&lt;/strong&gt; block both readers and writers. When Transaction B tries to write a row that Transaction A holds a shared lock on, Transaction B waits until A releases the lock.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Correctness is straightforward. If you hold the lock, no one else can interfere.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: Readers block writers. Writers block readers. Under high concurrency, transactions spend more time waiting for locks than doing useful work. &lt;strong&gt;Deadlocks&lt;/strong&gt; arise when two transactions each hold a lock the other needs. The engine must detect the cycle and abort one transaction.&lt;/p&gt;
&lt;p&gt;MySQL/InnoDB uses row-level locking for write operations. SQL Server uses lock escalation (row to page to table) when too many individual locks are held. Both systems also implement MVCC to reduce reader-writer conflicts.&lt;/p&gt;
&lt;h2&gt;MVCC: Readers Never Block&lt;/h2&gt;
&lt;p&gt;Multi-Version Concurrency Control solves the reader-writer conflict by keeping multiple versions of each row. Writers create new versions instead of overwriting the current one. Readers see the version that was current when their transaction started.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/mvcc-version-chain.png&quot; alt=&quot;MVCC version chain showing three versions of the same row with different transactions seeing different versions&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Each transaction gets a snapshot identifier when it starts (typically a transaction ID or timestamp).&lt;/li&gt;
&lt;li&gt;When a transaction reads a row, the engine walks the version chain and returns the most recent version that was committed before the transaction&apos;s snapshot.&lt;/li&gt;
&lt;li&gt;When a transaction writes a row, it creates a new version. The old version remains available for transactions that started earlier.&lt;/li&gt;
&lt;li&gt;A background garbage collection process (PostgreSQL calls it VACUUM) removes old versions that no transaction can see anymore.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;The key property&lt;/strong&gt;: Readers never block and are never blocked. A long-running analytical query sees a consistent snapshot of the entire database as it existed at the moment the query started, even if other transactions commit changes during execution.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: Version storage consumes space. PostgreSQL stores old versions in the heap table itself, requiring VACUUM to reclaim space. If VACUUM falls behind, the table bloats and performance degrades. Oracle and MySQL/InnoDB store old versions in a separate undo log, which is cleaner but adds complexity.&lt;/p&gt;
&lt;p&gt;PostgreSQL, Oracle, MySQL/InnoDB, SQL Server, CockroachDB, DuckDB, Snowflake, and Dremio all use MVCC. It is the dominant concurrency control mechanism in modern databases.&lt;/p&gt;
&lt;h2&gt;Isolation Levels&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/isolation-levels.png&quot; alt=&quot;Isolation level spectrum from Read Uncommitted (weakest, fastest) to Serializable (strongest, slowest)&quot;&gt;&lt;/p&gt;
&lt;p&gt;The SQL standard defines four isolation levels that control what anomalies a transaction can observe:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Prevents&lt;/th&gt;
&lt;th&gt;Allows&lt;/th&gt;
&lt;th&gt;Performance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read Uncommitted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nothing&lt;/td&gt;
&lt;td&gt;Dirty reads, non-repeatable reads, phantoms&lt;/td&gt;
&lt;td&gt;Fastest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read Committed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dirty reads&lt;/td&gt;
&lt;td&gt;Non-repeatable reads, phantoms&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repeatable Read&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dirty reads, non-repeatable reads&lt;/td&gt;
&lt;td&gt;Phantoms (in some systems)&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Serializable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All anomalies&lt;/td&gt;
&lt;td&gt;Nothing&lt;/td&gt;
&lt;td&gt;Slowest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Dirty read&lt;/strong&gt;: Transaction A sees uncommitted changes from Transaction B. If B rolls back, A has acted on data that never existed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Non-repeatable read&lt;/strong&gt;: Transaction A reads a row, Transaction B modifies and commits it, Transaction A reads the same row and gets a different value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phantom&lt;/strong&gt;: Transaction A runs a query with a range condition, Transaction B inserts a new row matching that condition and commits, Transaction A re-runs the query and gets an extra row.&lt;/p&gt;
&lt;p&gt;Most production systems default to &lt;strong&gt;Read Committed&lt;/strong&gt; (PostgreSQL, Oracle, SQL Server) or &lt;strong&gt;Repeatable Read&lt;/strong&gt; (MySQL/InnoDB). &lt;strong&gt;Serializable&lt;/strong&gt; provides the strongest guarantees but at the highest cost: either through strict two-phase locking (which reduces concurrency) or serializable snapshot isolation (which detects conflicts and aborts transactions).&lt;/p&gt;
&lt;h2&gt;Optimistic Concurrency Control (OCC)&lt;/h2&gt;
&lt;p&gt;OCC takes the opposite approach from locking: assume conflicts are rare and do not acquire locks during the transaction. Instead, the transaction reads and writes freely, then checks for conflicts at commit time.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Read phase&lt;/strong&gt;: The transaction executes all reads and writes in a local workspace.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validation phase&lt;/strong&gt;: At commit time, the engine checks whether any data the transaction read was modified by another committed transaction since it started.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write phase&lt;/strong&gt;: If validation passes, the changes are written permanently. If not, the transaction is aborted and must retry.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: No lock contention during execution. If conflicts are truly rare, OCC achieves high throughput because transactions never wait.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: If conflicts are frequent, transactions are repeatedly aborted and retried, wasting all the work done before validation. OCC works well when contention is low and transactions are short.&lt;/p&gt;
&lt;p&gt;CockroachDB and TiDB use forms of optimistic concurrency control. Google Spanner uses a hybrid approach.&lt;/p&gt;
&lt;h2&gt;How Lakehouse Table Formats Handle Concurrency&lt;/h2&gt;
&lt;p&gt;Apache Iceberg, Delta Lake, and Apache Hudi take a fundamentally different approach to concurrency because they operate on immutable files in object storage rather than mutable database pages.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Iceberg&apos;s approach&lt;/strong&gt;: Writes produce new data files and new metadata files (manifests, snapshot). The commit is an atomic pointer swap of the metadata file. Concurrent writers that do not conflict (e.g., inserting into different partitions) both succeed via optimistic concurrency with retry. Conflicting writes (e.g., both deleting the same rows) are detected at commit time and one writer retries.&lt;/p&gt;
&lt;p&gt;Readers always see a consistent snapshot because they read from a fixed snapshot pointer. There is no locking, no blocking, and no VACUUM needed. Old snapshots and their data files are cleaned up by an explicit expire_snapshots operation.&lt;/p&gt;
&lt;p&gt;This model is why lakehouse engines like Dremio, Spark, and Trino can run long analytical queries concurrently with ongoing data ingestion without any interference. The reader sees the snapshot that existed when the query started; the writer creates a new snapshot that future queries will see.&lt;/p&gt;
&lt;h2&gt;Where Real Systems Land&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Primary Mechanism&lt;/th&gt;
&lt;th&gt;Default Isolation&lt;/th&gt;
&lt;th&gt;Write Conflicts&lt;/th&gt;
&lt;th&gt;Garbage Collection&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;MVCC (heap-stored versions)&lt;/td&gt;
&lt;td&gt;Read Committed&lt;/td&gt;
&lt;td&gt;Row-level locking&lt;/td&gt;
&lt;td&gt;VACUUM (autovacuum)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL/InnoDB&lt;/td&gt;
&lt;td&gt;MVCC (undo log) + row locks&lt;/td&gt;
&lt;td&gt;Repeatable Read&lt;/td&gt;
&lt;td&gt;Row-level locking&lt;/td&gt;
&lt;td&gt;Purge thread&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oracle&lt;/td&gt;
&lt;td&gt;MVCC (undo tablespace)&lt;/td&gt;
&lt;td&gt;Read Committed&lt;/td&gt;
&lt;td&gt;Row-level locking&lt;/td&gt;
&lt;td&gt;Automatic undo management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CockroachDB&lt;/td&gt;
&lt;td&gt;MVCC + OCC&lt;/td&gt;
&lt;td&gt;Serializable&lt;/td&gt;
&lt;td&gt;Optimistic with retry&lt;/td&gt;
&lt;td&gt;GC job&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;MVCC&lt;/td&gt;
&lt;td&gt;Snapshot&lt;/td&gt;
&lt;td&gt;Single-writer lock&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;MVCC (micro-partition versioning)&lt;/td&gt;
&lt;td&gt;Read Committed&lt;/td&gt;
&lt;td&gt;Automatic conflict detection&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio + Iceberg&lt;/td&gt;
&lt;td&gt;Snapshot isolation (immutable files)&lt;/td&gt;
&lt;td&gt;Snapshot&lt;/td&gt;
&lt;td&gt;Optimistic commit with retry&lt;/td&gt;
&lt;td&gt;expire_snapshots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark + Delta Lake&lt;/td&gt;
&lt;td&gt;Optimistic concurrency (transaction log)&lt;/td&gt;
&lt;td&gt;Snapshot / Serializable&lt;/td&gt;
&lt;td&gt;Conflict detection at commit&lt;/td&gt;
&lt;td&gt;VACUUM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;The Fundamental Tradeoff&lt;/h2&gt;
&lt;p&gt;Every concurrency control mechanism trades throughput for correctness guarantees:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stronger isolation&lt;/strong&gt; (Serializable, strict locking) prevents more anomalies but reduces the number of transactions that can run concurrently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weaker isolation&lt;/strong&gt; (Read Committed, optimistic) allows more concurrent transactions but permits anomalies that application code must handle.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MVCC with snapshot isolation&lt;/strong&gt; provides a pragmatic middle ground: readers never block, writers are serialized on conflicting rows, and the only anomaly permitted (write skew) is rare in most applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most analytical engines (Dremio, Snowflake, BigQuery, DuckDB) default to snapshot isolation because analytical workloads are read-heavy with infrequent writes. The readers-never-block property of MVCC is exactly what long-running analytical queries need.&lt;/p&gt;
&lt;p&gt;There is no single best concurrency control strategy. The right choice depends on your ratio of reads to writes, the frequency of conflicts, and how much application complexity you are willing to accept.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How Data Lake Table Storage Degrades Over Time</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:08:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 9 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;Part 8&lt;/a&gt; covered embedded catalogs. This article explains the five ways Iceberg table storage degrades and how to detect each problem before it impacts query performance.&lt;/p&gt;
&lt;p&gt;An Iceberg table that works well on day one will not work well on day 365 without maintenance. Every append, update, and delete operation adds files and metadata. Without periodic cleanup and reorganization, query performance gradually deteriorates until someone notices that a dashboard that used to load in 2 seconds now takes 30.&lt;/p&gt;
&lt;h2&gt;Five Types of Degradation&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/storage-degradation-timeline.png&quot; alt=&quot;The five ways Iceberg table storage degrades over time, from small files to partition skew&quot;&gt;&lt;/p&gt;
&lt;h3&gt;1. The Small File Problem&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/small-file-problem.png&quot; alt=&quot;The small file problem comparing a healthy table with large files to a degraded table with thousands of tiny files&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is the most common and most impactful degradation. Streaming ingestion, micro-batch pipelines, and frequent INSERT operations each create new data files. If these operations produce many small files (under 32 MB), the table accumulates thousands of files where dozens would suffice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; Each file becomes a manifest entry. A table with 10,000 small files has 10,000 entries that the query planner must evaluate, compared to 40 entries for the same data in properly-sized 256 MB files. Planning time increases linearly with file count.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; Frequent commits with small amounts of data. A streaming pipeline committing every 30 seconds might add 2-3 files per commit, producing 5,000+ files per day.&lt;/p&gt;
&lt;h3&gt;2. Orphan Files&lt;/h3&gt;
&lt;p&gt;Orphan files are data files that exist in storage but are not referenced by any current or retained snapshot. They accumulate from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Failed writes:&lt;/strong&gt; A write that crashes after creating data files but before committing (&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Part 6&lt;/a&gt;) leaves orphan files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expired snapshots:&lt;/strong&gt; When snapshots are expired, the metadata references are removed, but the underlying data files remain in storage until explicitly cleaned up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction:&lt;/strong&gt; When &lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/&quot;&gt;compaction&lt;/a&gt; merges files, the old files become orphans after their snapshots are expired.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; Orphan files waste storage space and money. A heavily-written table can accumulate terabytes of orphan files over months. In one common scenario, a daily batch pipeline writing 50 GB per day with weekly compaction can produce 350 GB of orphan files every week. Without cleanup, this costs thousands of dollars annually in storage fees alone.&lt;/p&gt;
&lt;h3&gt;3. Metadata Bloat&lt;/h3&gt;
&lt;p&gt;Every commit creates a new snapshot in &lt;code&gt;metadata.json&lt;/code&gt;. Over time, the metadata file grows as the snapshot list lengthens. The manifest list for each snapshot may also reference many manifest files, especially if the table has been modified in many different partitions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; The &lt;code&gt;metadata.json&lt;/code&gt; file becomes large, taking longer to download from object storage. At 10,000+ snapshots, the metadata file itself can exceed 100 MB, adding seconds to every query&apos;s planning phase. The manifest list grows, making scan planning slower because there are more manifests to evaluate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How to detect it:&lt;/strong&gt; Check the snapshot count using &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;metadata tables&lt;/a&gt;. If it exceeds 1,000, configure snapshot expiry to keep the count manageable.&lt;/p&gt;
&lt;h3&gt;4. Sort Order Decay&lt;/h3&gt;
&lt;p&gt;If a table has a declared sort order (e.g., sorted by &lt;code&gt;customer_id&lt;/code&gt; for efficient lookups), new data written by different engines or pipelines may not respect this sort order. Over time, the min/max statistics per file widen as new unsorted data is mixed with sorted data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; File skipping becomes less effective. As described in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Part 3&lt;/a&gt;, tight min/max ranges enable file pruning. Wide ranges mean no files can be skipped. A well-sorted table might skip 95% of files for a filtered query, while the same table with decayed sort order might skip only 10%.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How to fix it:&lt;/strong&gt; Run &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;compaction with sorting&lt;/a&gt; to rewrite files in the correct order and restore tight min/max ranges.&lt;/p&gt;
&lt;h3&gt;5. Partition Skew&lt;/h3&gt;
&lt;p&gt;Some partitions grow much larger than others. An event table partitioned by &lt;code&gt;day(event_time)&lt;/code&gt; might have 10 GB on a normal day but 500 GB during a promotional event. The oversized partition contains files that are too large or too numerous for efficient processing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; Queries against skewed partitions are slower because they must process disproportionately more data. Parallel execution becomes unbalanced when one partition&apos;s task takes 50x longer than the others.&lt;/p&gt;
&lt;h2&gt;Real-World Degradation Timeline&lt;/h2&gt;
&lt;p&gt;Consider a table receiving 100 small appends per day from a streaming pipeline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Day 1:&lt;/strong&gt; 100 small files (3 MB each), 300 MB total. Queries are fast.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Day 30:&lt;/strong&gt; 3,000 small files, 9 GB total. Query planning starts to slow noticeably.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Day 90:&lt;/strong&gt; 9,000 small files, 27 GB total. Every query scans all 9,000 manifest entries. Dashboard queries that took 2 seconds now take 15 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Day 180:&lt;/strong&gt; 18,000 small files plus thousands of orphan files from expired snapshots. Metadata file is 50+ MB. Planning alone takes 10 seconds before any data is read.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without compaction, the table becomes nearly unusable for interactive analytics within 6 months. With daily compaction, the same table stays at 40-50 well-sized files regardless of how many commits happen each day.&lt;/p&gt;
&lt;h2&gt;How to Diagnose Table Health&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/health-diagnosis-checklist.png&quot; alt=&quot;Checklist for diagnosing Iceberg table health using metadata table queries&quot;&gt;&lt;/p&gt;
&lt;p&gt;Iceberg provides &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;metadata tables&lt;/a&gt; that let you inspect table health. Here are the key diagnostic queries:&lt;/p&gt;
&lt;h3&gt;Check File Sizes (Dremio / Spark)&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Average file size
SELECT AVG(file_size_in_bytes) / 1024 / 1024 AS avg_mb
FROM TABLE(table_files(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If average file size is below 32 MB, you have a small file problem. Target: 128-512 MB.&lt;/p&gt;
&lt;h3&gt;Check Snapshot Count&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- How many snapshots exist?
SELECT COUNT(*) AS snapshot_count
FROM TABLE(table_snapshot(&apos;analytics.orders&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If snapshot count exceeds 1,000, you should expire older snapshots.&lt;/p&gt;
&lt;h3&gt;Check File Count Growth&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Files per partition
SELECT partition, COUNT(*) AS file_count
FROM TABLE(table_files(&apos;analytics.orders&apos;))
GROUP BY partition
ORDER BY file_count DESC
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partitions with hundreds of files are candidates for compaction.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://docs.dremio.com/cloud/sonar/query-manage/querying-metadata/&quot;&gt;Dremio&lt;/a&gt; supports all Iceberg metadata table queries and provides a SQL interface for monitoring table health.&lt;/p&gt;
&lt;h2&gt;The Maintenance Imperative&lt;/h2&gt;
&lt;p&gt;Every Iceberg table in production needs maintenance. The question is not whether to maintain tables but how: manually, through scheduled jobs, or through automated services. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Part 10&lt;/a&gt; covers all three approaches in detail.&lt;/p&gt;
&lt;p&gt;The cost of not maintaining Iceberg tables is both direct (wasted storage from orphan files) and indirect (slow queries leading to poor user experience, excessive cloud compute costs from reading unnecessary data). Organizations with hundreds of Iceberg tables often find that a single data engineer dedicated to table maintenance saves more in compute and storage costs than their salary. Automated maintenance through &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;Dremio&lt;/a&gt; or S3 Tables removes this operational burden entirely.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hash, Sort-Merge, Broadcast: How Distributed Joins Work</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-09/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-09/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:08:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 9 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Part 8&lt;/a&gt; covered partitioning. This article covers the most expensive operation in distributed query processing: joining two tables whose data lives on different nodes.&lt;/p&gt;
&lt;p&gt;In a single-node database, a join is a CPU-bound operation. In a distributed engine, it becomes a network-bound operation because the data for matching rows may live on different machines. The choice of join strategy determines how much data moves across the network, and network I/O is typically the bottleneck in distributed query execution.&lt;/p&gt;
&lt;h2&gt;The Fundamental Problem&lt;/h2&gt;
&lt;p&gt;To join two tables on a key, matching rows must end up on the same compute node. If Table A&apos;s row with &lt;code&gt;customer_id = 42&lt;/code&gt; is on Node 1 and Table B&apos;s row with &lt;code&gt;customer_id = 42&lt;/code&gt; is on Node 3, one of those rows must move before the join can happen.&lt;/p&gt;
&lt;p&gt;Distributed engines have three strategies for solving this: shuffle both tables, broadcast the small one, or pre-arrange the data so matching keys are already co-located.&lt;/p&gt;
&lt;h2&gt;Shuffle Join: Redistribute Both Sides&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/shuffle-join.png&quot; alt=&quot;Shuffle join redistributing both tables by join key hash so matching keys land on the same node&quot;&gt;&lt;/p&gt;
&lt;p&gt;The shuffle join (also called repartition join or hash-exchange join) is the default strategy for joining two large tables. Both tables are re-hashed by the join key and redistributed across the cluster so that all rows with the same join key value land on the same destination node.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Each node reads its local portion of Table A, hashes each row&apos;s join key, and sends it to the appropriate destination node.&lt;/li&gt;
&lt;li&gt;Each node does the same for Table B.&lt;/li&gt;
&lt;li&gt;Each destination node now has all matching rows from both tables and performs a local join.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;When it is used&lt;/strong&gt;: Two large tables where neither is small enough to broadcast. This is the most common scenario for analytical joins (fact-to-fact joins, large table self-joins).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The cost&lt;/strong&gt;: Every row of both tables is sent over the network. For two 100 GB tables, the shuffle moves up to 200 GB of data across the cluster. Network bandwidth becomes the bottleneck.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: Spark (default shuffle join), Dremio, Snowflake, Trino, BigQuery, Redshift. Every distributed analytical engine implements shuffle joins.&lt;/p&gt;
&lt;h2&gt;Broadcast Join: Copy the Small Side&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/broadcast-join.png&quot; alt=&quot;Broadcast join copying the small dimension table to every compute node while the large fact table stays in place&quot;&gt;&lt;/p&gt;
&lt;p&gt;When one side of the join is small enough to fit in memory on each node, broadcasting it is far cheaper than shuffling both sides.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The small table (the &amp;quot;build side&amp;quot;) is read in full and sent to every node in the cluster.&lt;/li&gt;
&lt;li&gt;Each node builds an in-memory hash table from the broadcast data.&lt;/li&gt;
&lt;li&gt;Each node scans its local portion of the large table and probes the hash table for matches.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;When it is used&lt;/strong&gt;: Fact-to-dimension joins where the dimension table is small (typically under a few hundred MB). A 10 MB dimension table broadcast to 100 nodes costs 1 GB of network transfer. Shuffling both a 10 MB table and a 100 GB table would cost over 100 GB of transfer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The risk&lt;/strong&gt;: If the optimizer incorrectly estimates the small table&apos;s size and broadcasts a table that is actually large, every node receives a huge copy. This can cause out-of-memory errors and is one of the most common performance disasters in distributed engines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: Spark (broadcast hint or auto-broadcast threshold), Dremio (automatic broadcast decisions), Snowflake (automatic), Trino (broadcast join), BigQuery (automatic).&lt;/p&gt;
&lt;h2&gt;Co-Located Join: No Data Movement&lt;/h2&gt;
&lt;p&gt;The fastest distributed join is one where no data moves at all. If both tables are already partitioned by the join key with the same number of partitions, matching keys are guaranteed to be on the same node. Each node performs a local join independently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When it is used&lt;/strong&gt;: Both tables were deliberately bucketed/partitioned by the same key. This requires planning at data load time. In Spark, this means both tables are bucketed by the join key into the same number of buckets. In Iceberg, this means both tables are partitioned with matching partition transforms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The tradeoff&lt;/strong&gt;: You are locking in a specific physical layout at write time to benefit one join pattern. Queries that join on a different key do not benefit and still require shuffles. This is a deliberate investment in one access pattern at the cost of flexibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: Spark (bucket joins), Hive (sort-merge bucket joins), Dremio (co-located joins on matching Iceberg partitions).&lt;/p&gt;
&lt;h2&gt;Local Join Algorithms: Hash vs. Sort-Merge&lt;/h2&gt;
&lt;p&gt;Once matching rows are on the same node (via shuffle, broadcast, or co-location), the engine performs a local join using one of two algorithms.&lt;/p&gt;
&lt;h3&gt;Hash Join&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/hash-join-mechanics.png&quot; alt=&quot;Hash join build and probe phases showing hash table construction from the smaller table and probing from the larger table&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Build phase&lt;/strong&gt;: Read the smaller table and insert each row into an in-memory hash table, keyed by the join column.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Probe phase&lt;/strong&gt;: Read the larger table row by row. For each row, hash the join key and look up the matching bucket in the hash table. Emit matching pairs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complexity&lt;/strong&gt;: O(N + M) where N and M are the sizes of the two tables. Linear time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Fast when the build side fits in memory. No sorting required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: If the build side is too large for memory, the engine must use a grace hash join or hybrid hash join that spills partitions to disk, significantly increasing I/O.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: Every analytical engine (Dremio, Spark, Snowflake, DuckDB, ClickHouse, Trino, Redshift, BigQuery) defaults to hash joins for equi-joins.&lt;/p&gt;
&lt;h3&gt;Sort-Merge Join&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Sort phase&lt;/strong&gt;: Sort both tables by the join key. If either table is already sorted (from an index or previous sort operation), this phase is skipped.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Merge phase&lt;/strong&gt;: Walk through both sorted tables simultaneously. When join keys match, emit the pair. When one side is ahead, advance the other.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complexity&lt;/strong&gt;: O(N log N + M log M) for sorting, plus O(N + M) for merging.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Handles very large datasets gracefully because sorting can use external sort (spill to disk in sorted runs). Does not require the build side to fit in memory. If the data is already sorted, the merge phase alone is O(N + M) with no memory pressure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: Sorting is expensive. For unsorted data, hash join is almost always faster.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: PostgreSQL (merge join), Spark (sort-merge join, the default when tables are large), Oracle (sort-merge join option). Dremio and Snowflake generally prefer hash joins and fall back to sort-merge when memory is constrained.&lt;/p&gt;
&lt;h2&gt;How Optimizers Choose&lt;/h2&gt;
&lt;p&gt;The optimizer selects a join strategy based on table sizes, data distribution, and available resources:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One side is small (&amp;lt; broadcast threshold)&lt;/td&gt;
&lt;td&gt;Dimension table &amp;lt; 100 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Broadcast join&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Both sides are large, not co-located&lt;/td&gt;
&lt;td&gt;Fact-to-fact join&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Shuffle + hash join&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Both sides bucketed by join key&lt;/td&gt;
&lt;td&gt;Pre-planned layout&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Co-located join&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory constrained, large tables&lt;/td&gt;
&lt;td&gt;Hash table spills&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Shuffle + sort-merge join&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data is skewed&lt;/td&gt;
&lt;td&gt;One join key dominates&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Skew-aware shuffle&lt;/strong&gt; (split hot keys)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Spark&apos;s Adaptive Query Execution can change this decision at runtime. If a shuffle reveals that one side is actually small, AQE converts the shuffle join to a broadcast join mid-flight. Dremio makes similar adaptive decisions based on runtime statistics.&lt;/p&gt;
&lt;h2&gt;The Network Bottleneck&lt;/h2&gt;
&lt;p&gt;In distributed analytics, the network is usually the bottleneck for join-heavy queries. A rough hierarchy of join costs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Co-located join&lt;/strong&gt;: 0 bytes transferred. Free.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broadcast join (small side)&lt;/strong&gt;: Small table size x number of nodes. Cheap.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shuffle join&lt;/strong&gt;: Both table sizes transferred. Expensive.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broadcast join (large side, mistaken)&lt;/strong&gt;: Large table size x number of nodes. Catastrophic.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is why data engineers spend time on partitioning and bucketing strategies: every byte that does not need to move across the network is a byte that does not cost time.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>When Catalogs Are Embedded in Storage</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:07:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 8 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;Part 7&lt;/a&gt; covered the traditional catalog landscape. This article examines a newer approach: embedding the catalog directly inside the storage layer.&lt;/p&gt;
&lt;p&gt;Traditional Iceberg architectures have three components: the query engine, a standalone catalog, and object storage. Embedded catalogs collapse the catalog into the storage layer itself, reducing the number of services to manage while providing built-in table maintenance.&lt;/p&gt;
&lt;h2&gt;The Embedded Catalog Model&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/embedded-vs-standalone.png&quot; alt=&quot;Standalone catalogs versus embedded catalogs showing how the architecture simplifies&quot;&gt;&lt;/p&gt;
&lt;p&gt;In a traditional setup, a separate catalog service (Polaris, Glue, Nessie) runs alongside object storage. The engine talks to the catalog to get metadata pointers, then reads data from storage. Two services, two sets of credentials, two operational concerns.&lt;/p&gt;
&lt;p&gt;In an embedded model, the storage service itself manages Iceberg metadata. When you create a table, the storage system creates the metadata files internally and handles atomic commits, compaction, and snapshot management. The engine interacts with a single endpoint that serves both catalog operations and data access.&lt;/p&gt;
&lt;h2&gt;AWS S3 Tables&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/s3-tables-architecture.png&quot; alt=&quot;S3 Tables architecture showing the built-in Iceberg catalog with automatic compaction&quot;&gt;&lt;/p&gt;
&lt;p&gt;AWS launched S3 Tables in late 2024 as a new S3 bucket type designed specifically for Iceberg tables. When you create an S3 table bucket, AWS manages the Iceberg catalog internally.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; You create tables through the S3 Tables API or through engines like Athena and EMR. S3 Tables stores the Iceberg metadata alongside the data in the same bucket, handling the catalog pointer, manifest management, and atomic commits behind the scenes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Built-in maintenance:&lt;/strong&gt; S3 Tables runs automatic compaction in the background, merging small files into optimally-sized ones without any user configuration. It also handles snapshot expiry and orphan file cleanup. This eliminates one of the biggest operational burdens of Iceberg (covered in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Part 10&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Access via REST API:&lt;/strong&gt; S3 Tables exposes tables through a REST-catalog-compatible interface. &lt;a href=&quot;https://www.dremio.com/blog/getting-hands-on-with-s3-tables-from-dremio/&quot;&gt;Dremio&lt;/a&gt;, Spark, Trino, and other engines that support the Iceberg REST catalog can connect to S3 Tables directly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Built-in lifecycle management:&lt;/strong&gt; Beyond compaction, S3 Tables handles the entire table maintenance lifecycle. Snapshot expiry happens automatically based on configurable retention policies. Orphan files are cleaned up without user intervention. For teams that do not want to manage &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;maintenance schedules&lt;/a&gt;, this is a significant operational advantage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt; S3 Tables is AWS-only. Tables are stored exclusively in S3 and cannot be moved to other cloud providers without migration. Cross-engine governance is limited to what AWS IAM provides. If you need fine-grained access control beyond IAM policies (column-level masking, row-level filters), you need a standalone catalog layer on top.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost model:&lt;/strong&gt; S3 Tables uses a different pricing model than standard S3. Storage and request costs are similar, but the built-in maintenance operations (compaction, expiry) are included in the service price. Compare this to running Spark compaction jobs on EMR, which adds compute costs on top of storage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table bucket vs. general-purpose bucket:&lt;/strong&gt; S3 Tables uses a new &amp;quot;table bucket&amp;quot; type, separate from standard S3 buckets. You cannot mix table data with other objects in a table bucket, and standard S3 operations (ls, cp, rm) do not work on table bucket contents. All interaction goes through the S3 Tables API or through Iceberg-compatible engines.&lt;/p&gt;
&lt;h2&gt;MinIO AI Stor&lt;/h2&gt;
&lt;p&gt;MinIO AI Stor takes a similar approach for on-premises and private cloud deployments. MinIO, the leading S3-compatible object storage system, embeds Iceberg catalog functionality directly into the storage layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MinIO manages Iceberg table metadata as part of its storage operations. When data is written, MinIO handles the catalog updates, file tracking, and maintenance internally.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Key differentiator:&lt;/strong&gt; MinIO is designed for on-premises deployments and private clouds, making it the embedded catalog option for organizations that cannot use public cloud services. It also integrates vector storage capabilities for AI workloads alongside Iceberg tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;S3 compatibility:&lt;/strong&gt; Because MinIO implements the S3 API, engines that work with S3 (Spark, Trino, &lt;a href=&quot;https://www.dremio.com/platform/&quot;&gt;Dremio&lt;/a&gt;) can interact with MinIO-managed Iceberg tables with minimal configuration changes. This makes it a drop-in replacement for S3 in on-premises environments.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GPU-accelerated analytics:&lt;/strong&gt; MinIO AI Stor integrates with GPU-aware processing frameworks, enabling direct analytics on Iceberg data without moving it to a separate compute layer. This is relevant for organizations running AI/ML workloads alongside traditional analytics.&lt;/p&gt;
&lt;h2&gt;When Embedded Catalogs Make Sense&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/embedded-decision-tree.png&quot; alt=&quot;Decision tree for choosing between embedded and standalone catalogs&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS-only, want minimal ops&lt;/td&gt;
&lt;td&gt;S3 Tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises, private cloud&lt;/td&gt;
&lt;td&gt;MinIO AI Stor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud portability needed&lt;/td&gt;
&lt;td&gt;Standalone catalog (&lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-engine governance needed&lt;/td&gt;
&lt;td&gt;Standalone catalog (&lt;a href=&quot;https://www.dremio.com/blog/the-polaris-catalog-what-it-is-and-getting-started/&quot;&gt;Polaris&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple storage systems&lt;/td&gt;
&lt;td&gt;Standalone catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single storage, simple setup&lt;/td&gt;
&lt;td&gt;Embedded catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Embedded catalogs are the right choice when you have a single storage system and want to minimize operational complexity. They trade flexibility for simplicity.&lt;/p&gt;
&lt;p&gt;Standalone catalogs remain the better choice when you need multi-cloud support, cross-engine governance, or the ability to query data across multiple storage systems through &lt;a href=&quot;https://www.dremio.com/platform/federation/&quot;&gt;federation&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;The Hybrid Approach&lt;/h2&gt;
&lt;p&gt;Many organizations use both. An embedded catalog handles the storage-managed tables (S3 Tables for their AWS data), while a standalone catalog like &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt; provides a unified view across all data sources. Dremio can connect to S3 Tables, AWS Glue tables, and standalone catalog tables simultaneously, presenting them all through a single semantic layer.&lt;/p&gt;
&lt;p&gt;This hybrid approach lets you pick the simplest catalog for each use case while maintaining a unified analytics experience.&lt;/p&gt;
&lt;h2&gt;Operational Planning for Embedded Catalogs&lt;/h2&gt;
&lt;p&gt;When adopting an embedded catalog, plan for these considerations:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vendor dependency:&lt;/strong&gt; An embedded catalog ties your tables to the storage vendor&apos;s lifecycle. If the vendor changes pricing, deprecates features, or discontinues the product, migrating away requires converting all tables to a different catalog. With a standalone catalog, switching storage providers only requires changing the storage configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monitoring limitations:&lt;/strong&gt; Embedded catalogs provide limited visibility into their internal maintenance operations. You cannot inspect the compaction schedule, tune the target file size, or monitor orphan cleanup progress as precisely as you can with manual maintenance via Spark procedures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-region access:&lt;/strong&gt; Embedded catalogs are scoped to a storage region. If your analytics workloads run in a different region than your storage, the embedded catalog adds cross-region latency. A standalone catalog can be deployed in the same region as your compute for lower latency.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Integration testing:&lt;/strong&gt; Before committing to an embedded catalog for production, test your full query stack (dashboards, notebooks, scheduled pipelines) against the embedded catalog endpoint. Verify that your engines handle the catalog&apos;s REST API implementation correctly, as there can be subtle differences between implementations.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;Part 9&lt;/a&gt; covers how table storage degrades over time and why maintenance matters regardless of which catalog you use.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Partitioning, Sharding, and Data Distribution Strategies</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-08/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-08/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:07:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 8 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Part 7&lt;/a&gt; covered memory management. This article covers how engines divide data across files, disks, or cluster nodes to enable parallel processing and reduce the amount of data each query must touch.&lt;/p&gt;
&lt;p&gt;Partitioning answers a simple question: when a table has billions of rows, how do you avoid scanning all of them for every query? The answer is to divide the data into smaller, independent chunks and skip the chunks that cannot contain relevant data.&lt;/p&gt;
&lt;h2&gt;Hash Partitioning&lt;/h2&gt;
&lt;p&gt;Hash partitioning applies a hash function to a partition key and assigns each row to a bucket based on the hash value. With 4 partitions and a hash of &lt;code&gt;customer_id&lt;/code&gt;, rows are distributed as &lt;code&gt;hash(customer_id) % 4&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/partition-types.png&quot; alt=&quot;Hash, range, and list partitioning strategies showing how the same data is distributed differently&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Even data distribution regardless of key distribution. No hotspots unless the hash function is poorly chosen. Good for point lookups on the partition key (the engine hashes the lookup value and checks only the matching partition).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: Range scans are expensive. A query like &lt;code&gt;WHERE customer_id BETWEEN 1000 AND 2000&lt;/code&gt; must check all partitions because the hash function scatters sequential keys across buckets. Adding or removing partitions requires re-hashing and redistributing most of the data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: CockroachDB (hash-based ranges), Cassandra (consistent hashing), DynamoDB (hash partitions), Spark (default shuffle partitioning), Dremio (hash distribution for distributed execution).&lt;/p&gt;
&lt;h2&gt;Range Partitioning&lt;/h2&gt;
&lt;p&gt;Range partitioning divides the key space into contiguous ranges. Each partition owns a specific range of values. A date-partitioned table might have one partition per month: all January 2024 data in one partition, all February 2024 data in another.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Range scans on the partition key are fast because the engine reads only the partitions whose ranges overlap the query filter. Time-based queries on date-partitioned tables scan only the relevant months. This is the most common partitioning strategy for analytical data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: Susceptible to data skew. If most orders arrive in December, the December partition is much larger than June. Susceptible to write hotspots: in a time-partitioned table, all current writes go to the latest partition.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: PostgreSQL (native table partitioning), Hive (directory-based partitioning by date/region), Apache Iceberg (partition transforms including year, month, day, hour), BigQuery (partitioned tables by date), Dremio (reads and writes Iceberg partitioned tables).&lt;/p&gt;
&lt;h2&gt;List Partitioning&lt;/h2&gt;
&lt;p&gt;List partitioning assigns specific discrete values to specific partitions. A &lt;code&gt;region&lt;/code&gt; column might map &lt;code&gt;US&lt;/code&gt; to partition 1, &lt;code&gt;EU&lt;/code&gt; to partition 2, &lt;code&gt;APAC&lt;/code&gt; to partition 3.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Queries filtering on the partition column skip all other partitions. Data is grouped by business-meaningful categories.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: Uneven distribution if some values have far more rows than others. New values require manually creating or updating partition definitions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Used by&lt;/strong&gt;: PostgreSQL (list partitioning), Oracle (list partitioning), MySQL (list partitioning).&lt;/p&gt;
&lt;h2&gt;Partition Pruning: The Primary Performance Win&lt;/h2&gt;
&lt;p&gt;The biggest performance benefit of partitioning is not parallelism. It is pruning: the optimizer&apos;s ability to skip partitions that cannot contain matching data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/partition-pruning.png&quot; alt=&quot;Partition pruning showing 5 of 6 monthly partitions skipped when filtering by March&quot;&gt;&lt;/p&gt;
&lt;p&gt;A table partitioned by month with a query &lt;code&gt;WHERE month = &apos;Mar&apos;&lt;/code&gt; scans only the March partition and skips the other five. That is 83% less I/O with zero changes to the query. For a table partitioned into 365 daily partitions, a query on one day skips 99.7% of the data.&lt;/p&gt;
&lt;p&gt;Partition pruning is supported by PostgreSQL, Spark, Dremio, Snowflake, BigQuery, Hive, Trino, and essentially every analytical engine. It is often the single largest performance optimization available for large tables.&lt;/p&gt;
&lt;p&gt;Apache Iceberg improves on traditional partitioning with &lt;strong&gt;hidden partitioning&lt;/strong&gt;: the partition values are derived from data columns using transforms (year, month, day, hour, truncate, bucket) and stored in manifest metadata. Users write queries using the original columns (&lt;code&gt;WHERE order_date &amp;gt; &apos;2024-03-01&apos;&lt;/code&gt;) and the engine automatically prunes based on the partition structure without users needing to know the physical layout.&lt;/p&gt;
&lt;h2&gt;Bucketing and Clustering&lt;/h2&gt;
&lt;p&gt;Bucketing (Hive, Spark) and clustering (BigQuery, Snowflake, Dremio) go beyond partitioning by organizing data within partitions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bucketing&lt;/strong&gt; hashes data by a key into a fixed number of buckets within each partition. If two tables are bucketed by the same key into the same number of buckets, they can be joined without a shuffle because matching keys are guaranteed to be in the same bucket. This is called a bucket join or sort-merge bucket join.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clustering&lt;/strong&gt; sorts data within files by a designated column. This makes zone maps (min/max statistics) more effective because sorting clusters similar values together, narrowing the min/max range per file. A file where &lt;code&gt;customer_id&lt;/code&gt; ranges from 1 to 1,000,000 has a useless zone map for selective filters. A file where &lt;code&gt;customer_id&lt;/code&gt; ranges from 500 to 600 will be skipped by any filter outside that range.&lt;/p&gt;
&lt;p&gt;Dremio automates clustering through its table optimization jobs. When Dremio compacts an Iceberg table, it sorts the data by frequently filtered columns, tightening the min/max ranges and improving subsequent query pruning without manual intervention.&lt;/p&gt;
&lt;h2&gt;The Data Skew Problem&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/data-skew-problem.png&quot; alt=&quot;The data skew problem showing one partition with 10x more data creating a straggler node&quot;&gt;&lt;/p&gt;
&lt;p&gt;Partitioning assumes that data is distributed somewhat evenly across partitions. When it is not, one partition becomes a bottleneck.&lt;/p&gt;
&lt;p&gt;In a distributed engine, query time equals the time of the slowest node. If one node processes 500M rows while three others process 25M each, those three nodes sit idle waiting for the straggler. The cluster&apos;s effective throughput drops to one-quarter of its capacity.&lt;/p&gt;
&lt;p&gt;Skew arises from several sources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Natural data distribution&lt;/strong&gt;: A few customers generate most of the orders. A few products account for most of the sales.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time-based partitioning&lt;/strong&gt;: Recent partitions have more data than old ones in growing systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Key cardinality&lt;/strong&gt;: Partitioning by a low-cardinality column (status, region) creates few large partitions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Mitigation strategies&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Salting&lt;/strong&gt;: Add a random component to the partition key to spread hot keys across multiple partitions. Queries must then scan all salt values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adaptive partition splitting&lt;/strong&gt;: Spark AQE detects skewed partitions during shuffle and splits them automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Composite partitioning&lt;/strong&gt;: Partition by date at the top level and hash within each date partition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic resource allocation&lt;/strong&gt;: Some cloud engines allocate more compute to larger partitions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Where Real Systems Land&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Partitioning Strategy&lt;/th&gt;
&lt;th&gt;Pruning&lt;/th&gt;
&lt;th&gt;Clustering&lt;/th&gt;
&lt;th&gt;Skew Handling&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Range, list, hash (native)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Manual (CLUSTER command)&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark&lt;/td&gt;
&lt;td&gt;Hash (shuffle), range (sort)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Bucketing&lt;/td&gt;
&lt;td&gt;AQE skew join&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hive&lt;/td&gt;
&lt;td&gt;Directory-based (date, region)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Bucketing&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;Iceberg hidden partitioning&lt;/td&gt;
&lt;td&gt;Yes (manifest-level)&lt;/td&gt;
&lt;td&gt;Automatic (table optimization)&lt;/td&gt;
&lt;td&gt;Adaptive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;Micro-partitions (auto)&lt;/td&gt;
&lt;td&gt;Yes (pruning via metadata)&lt;/td&gt;
&lt;td&gt;Clustering keys&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Date/integer range, ingestion time&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Clustering columns&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cassandra&lt;/td&gt;
&lt;td&gt;Consistent hash (partition key)&lt;/td&gt;
&lt;td&gt;Token-range pruning&lt;/td&gt;
&lt;td&gt;Clustering columns (within partition)&lt;/td&gt;
&lt;td&gt;Virtual nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The tradeoff in every partitioning decision is specificity vs. flexibility. The more you optimize the partition scheme for one query pattern (e.g., filter by date), the worse other patterns become (e.g., filter by customer). Choosing the right partition key requires understanding which queries dominate your workload.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:06:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 7 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Part 6&lt;/a&gt; covered the write process and explained how the catalog enables atomic commits. This article covers what catalogs are, why they matter, and how to choose between the many options available in 2026.&lt;/p&gt;
&lt;p&gt;A lakehouse catalog is the component that answers one question: &amp;quot;Where is the current metadata for this table?&amp;quot; Without a catalog, every engine would need to independently locate and track metadata files. With a catalog, there is a single source of truth that coordinates reads, writes, and access control across all engines.&lt;/p&gt;
&lt;h2&gt;What a Catalog Does&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/catalog-role.png&quot; alt=&quot;The three core responsibilities of a lakehouse catalog&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every Iceberg catalog performs three functions:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Store the current metadata pointer.&lt;/strong&gt; When a query engine asks for table &lt;code&gt;analytics.orders&lt;/code&gt;, the catalog returns the location of the current &lt;code&gt;metadata.json&lt;/code&gt; file (e.g., &lt;code&gt;s3://warehouse/orders/metadata/v42.metadata.json&lt;/code&gt;). This is the most fundamental responsibility. As described in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Part 6&lt;/a&gt;, the atomic update of this pointer is what makes ACID transactions possible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manage namespaces.&lt;/strong&gt; Catalogs organize tables into hierarchical namespaces (databases, schemas). This provides logical organization (&lt;code&gt;production.analytics.orders&lt;/code&gt; vs &lt;code&gt;staging.analytics.orders&lt;/code&gt;) and is the foundation for access control.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Enforce access control.&lt;/strong&gt; Catalogs determine which users and engines can read, write, or manage specific tables and namespaces. This ranges from simple table-level permissions to fine-grained column-level and row-level security.&lt;/p&gt;
&lt;h2&gt;The REST Catalog Protocol&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/rest-catalog-protocol.png&quot; alt=&quot;How the Iceberg REST Catalog Protocol works from table load through atomic commit&quot;&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://iceberg.apache.org/spec/#rest-catalog&quot;&gt;Iceberg REST Catalog specification&lt;/a&gt; defines a standard HTTP API for catalog operations. This protocol has become the industry standard because it decouples the catalog implementation from the engine.&lt;/p&gt;
&lt;p&gt;The key operations:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Load table metadata location&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST /v1/namespaces/{ns}/tables&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Create a new table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Commit a table update (CAS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List available namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DELETE /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Drop a table&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The protocol includes &lt;strong&gt;credential vending&lt;/strong&gt;: the catalog returns short-lived storage credentials alongside the metadata location, so the engine can access the data files directly without needing permanent storage credentials. This is important for multi-tenant environments where &lt;a href=&quot;https://www.dremio.com/blog/what-is-the-iceberg-rest-catalog/&quot;&gt;Dremio&lt;/a&gt; and other engines need scoped access to specific tables.&lt;/p&gt;
&lt;h3&gt;Why the REST Protocol Matters&lt;/h3&gt;
&lt;p&gt;Before the REST catalog specification, every engine needed a custom integration for each catalog type. Spark had its own Hive Metastore connector, Trino had a different one, and adding a new catalog meant updating every engine. The REST protocol standardizes this: any engine that speaks REST can talk to any catalog that implements the specification.&lt;/p&gt;
&lt;p&gt;This is what makes the Iceberg ecosystem genuinely multi-engine. You can use Spark for ETL, &lt;a href=&quot;https://www.dremio.com/platform/&quot;&gt;Dremio&lt;/a&gt; for interactive analytics, Trino for exploration, and Flink for streaming, all pointed at the same REST catalog. Each engine sees the same tables, the same schemas, and the same snapshots.&lt;/p&gt;
&lt;h3&gt;Multi-Engine Coordination&lt;/h3&gt;
&lt;p&gt;When multiple engines share a catalog, the catalog becomes the coordination point for concurrent access. The &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;atomic compare-and-swap&lt;/a&gt; mechanism ensures that two engines writing to the same table cannot corrupt each other&apos;s commits. This is fundamentally different from file-system-based metastores where coordination relies on file renames that may not be atomic on object storage.&lt;/p&gt;
&lt;h3&gt;Governance Portability&lt;/h3&gt;
&lt;p&gt;One of the biggest concerns in the catalog landscape is governance portability. Access control policies (who can query what) are defined in the catalog, but there is no industry standard for sharing these policies across catalogs. If you set up row-level security in one catalog, that policy does not automatically transfer to another.&lt;/p&gt;
&lt;p&gt;This is why many architects recommend picking a catalog that will serve as the single governance boundary and having all engines connect through it, rather than having multiple catalogs with duplicate governance rules.&lt;/p&gt;
&lt;h2&gt;The Catalog Landscape&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/catalog-landscape.png&quot; alt=&quot;Open source vs managed catalog options in the 2026 lakehouse ecosystem&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Open Source Catalogs&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-polaris-catalog-what-it-is-and-getting-started/&quot;&gt;Apache Polaris&lt;/a&gt; (Apache Incubating).&lt;/strong&gt; The leading vendor-neutral REST catalog implementation. Co-created by Snowflake and Dremio, Polaris is designed to be engine-agnostic and cloud-agnostic. It implements the full REST Catalog spec with fine-grained access control and credential vending. It is the foundation for &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio&apos;s Open Catalog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Nessie.&lt;/strong&gt; Differentiates itself with Git-like branching and merging for data. You can create branches, make changes to multiple tables, and merge them atomically. This is useful for testing pipeline changes or implementing multi-table transactions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unity Catalog OSS.&lt;/strong&gt; Databricks&apos; open-source catalog offering. It provides multi-format support (Delta, Iceberg, Hudi) and includes AI/ML asset management (models, features). Closely tied to the Databricks ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lakekeeper.&lt;/strong&gt; A lightweight, Rust-native REST catalog implementation focused on performance and minimal operational footprint. Good for teams that want a self-hosted catalog without the complexity of larger platforms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Gravitino (Incubating).&lt;/strong&gt; A federation-focused catalog that can bridge multiple underlying catalogs and storage systems. Designed for organizations that need a unified metadata view across multiple Iceberg catalogs, Hive metastores, and other data sources.&lt;/p&gt;
&lt;h3&gt;Managed Catalogs&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;.&lt;/strong&gt; A managed Polaris-based catalog that includes &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;automatic table optimization&lt;/a&gt; (compaction, snapshot expiry, orphan cleanup) and integrates with Dremio&apos;s query engine, semantic layer, and AI capabilities.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AWS Glue.&lt;/strong&gt; Amazon&apos;s managed metastore service. Widely used because it is integrated with the AWS ecosystem (Athena, EMR, Redshift Spectrum). Supports Iceberg tables natively and acts as both a Hive-compatible metastore and an Iceberg catalog.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Databricks Unity Catalog (managed).&lt;/strong&gt; The enterprise version of Unity Catalog with additional governance features, lineage tracking, and AI asset management. Tightly integrated with the Databricks runtime.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snowflake Horizon.&lt;/strong&gt; Snowflake&apos;s catalog and governance layer that supports Iceberg tables in Snowflake-managed storage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google BigLake.&lt;/strong&gt; Google Cloud&apos;s managed metadata service for Iceberg tables on GCS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Microsoft OneLake.&lt;/strong&gt; Microsoft&apos;s unified storage and catalog layer within the Fabric ecosystem.&lt;/p&gt;
&lt;h2&gt;How to Choose a Catalog&lt;/h2&gt;
&lt;p&gt;The decision depends on three factors:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Recommended Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-engine, vendor-neutral&lt;/td&gt;
&lt;td&gt;REST catalog (Polaris or Lakekeeper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS-native, minimal ops&lt;/td&gt;
&lt;td&gt;AWS Glue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Databricks ecosystem&lt;/td&gt;
&lt;td&gt;Unity Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git-style data versioning&lt;/td&gt;
&lt;td&gt;Nessie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed with auto-optimization&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-catalog federation&lt;/td&gt;
&lt;td&gt;Gravitino&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The safest long-term choice is a REST-catalog-compatible implementation because every major engine supports the protocol. If you start with a REST catalog, you can swap implementations later without changing your engine configurations.&lt;/p&gt;
&lt;h2&gt;Catalogs Are Not Optional&lt;/h2&gt;
&lt;p&gt;Some teams try to use Iceberg without a proper catalog, relying on Hadoop-style file system catalogs that use file renames for atomicity. This works on HDFS but is unreliable on object storage (S3 does not support atomic renames). For production lakehouses on cloud storage, a proper catalog with server-side compare-and-swap is essential.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;Part 8&lt;/a&gt; covers a newer approach where the catalog is embedded directly in the storage layer.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Buffer Pools, Caches, and the Memory Hierarchy</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-07/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-07/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:06:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 7 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Part 6&lt;/a&gt; covered execution models. This article covers how engines manage their most precious resource: memory.&lt;/p&gt;
&lt;p&gt;RAM is 1,000x faster than SSD and 100,000x faster than HDD. The difference between a query that hits cached data and one that reads from disk is the difference between sub-second and minutes. Every database engine invests heavily in keeping the right data in memory and handling the cases where data does not fit.&lt;/p&gt;
&lt;h2&gt;The Memory Hierarchy&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/cache-hit-miss-latency.png&quot; alt=&quot;Cache hit versus cache miss latency showing the 1000x gap between RAM and SSD access&quot;&gt;&lt;/p&gt;
&lt;p&gt;The latency gap between memory tiers is not linear. It is exponential:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage Tier&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Relative Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L1 CPU cache&lt;/td&gt;
&lt;td&gt;~1 ns&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3 CPU cache&lt;/td&gt;
&lt;td&gt;~10 ns&lt;/td&gt;
&lt;td&gt;10x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main memory (RAM)&lt;/td&gt;
&lt;td&gt;~100 ns&lt;/td&gt;
&lt;td&gt;100x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVMe SSD&lt;/td&gt;
&lt;td&gt;~100,000 ns (100 us)&lt;/td&gt;
&lt;td&gt;100,000x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HDD&lt;/td&gt;
&lt;td&gt;~10,000,000 ns (10 ms)&lt;/td&gt;
&lt;td&gt;10,000,000x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This hierarchy is why caching strategies dominate database engineering. A well-tuned cache turns expensive disk reads into cheap memory lookups for the most frequently accessed data.&lt;/p&gt;
&lt;h2&gt;Buffer Pools: The OLTP Approach&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/buffer-pool-architecture.png&quot; alt=&quot;Buffer pool architecture showing shared memory pages, page table, cache hits, and background writes to disk&quot;&gt;&lt;/p&gt;
&lt;p&gt;Traditional relational databases (PostgreSQL, MySQL/InnoDB, Oracle, SQL Server) use a &lt;strong&gt;buffer pool&lt;/strong&gt;: a region of shared memory that holds copies of disk pages. When a query needs a page, the engine checks the buffer pool first. If the page is there (cache hit), it is returned immediately. If not (cache miss), the page is read from disk into the buffer pool, potentially evicting an older page.&lt;/p&gt;
&lt;h3&gt;Page Replacement Policies&lt;/h3&gt;
&lt;p&gt;When the buffer pool is full and a new page needs to be loaded, the engine must evict an existing page. The policy for choosing which page to evict has a significant impact on cache hit rates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LRU (Least Recently Used)&lt;/strong&gt; evicts the page that was accessed least recently. Simple but vulnerable to sequential scan pollution: a single full table scan loads every page into the pool, evicting frequently-used index pages. After the scan, the pool is full of pages that will never be accessed again.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clock&lt;/strong&gt; (used by PostgreSQL) is an approximation of LRU that avoids the overhead of maintaining a sorted access list. Each page has a reference bit. The clock hand sweeps through pages; if the bit is set, it clears it and moves on. If the bit is unset, the page is evicted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LRU-K&lt;/strong&gt; tracks the K-th most recent access time instead of just the most recent. A page accessed twice in the last minute ranks higher than a page accessed once a second ago. This resists sequential scan pollution because single-access pages never accumulate enough history to rank highly.&lt;/p&gt;
&lt;p&gt;PostgreSQL&apos;s &lt;code&gt;shared_buffers&lt;/code&gt; parameter controls buffer pool size. MySQL&apos;s &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; does the same. Typical production settings allocate 25-50% of total system memory to the buffer pool.&lt;/p&gt;
&lt;h2&gt;Columnar and Result Caches: The OLAP Approach&lt;/h2&gt;
&lt;p&gt;Analytical engines take a different approach. Instead of caching arbitrary disk pages, they cache data at higher levels of abstraction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column caches&lt;/strong&gt; store decoded, decompressed column data in memory in a format ready for vectorized processing. Dremio&apos;s C3 (Columnar Cloud Cache) caches Parquet column data on local NVMe SSDs, avoiding repeated reads from cloud object storage (S3, ADLS, GCS). This is critical because cloud object storage latency is 10-100x higher than local SSD.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result caches&lt;/strong&gt; store the output of entire queries. If the same query runs again and the underlying data has not changed, the cached result is returned instantly without re-executing the query. Snowflake, BigQuery, and Dremio all use result caching. The challenge is cache invalidation: when the underlying data changes, all cached results that depend on it must be invalidated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Materialized views / Reflections&lt;/strong&gt; precompute and store the results of common query patterns. Dremio&apos;s Reflections are a form of intelligent result caching: the engine automatically creates and maintains aggregation and raw Reflections based on query patterns, and the optimizer transparently routes queries to the appropriate Reflection when it matches. Unlike traditional result caches, Reflections persist across sessions and are automatically refreshed when source data changes.&lt;/p&gt;
&lt;h2&gt;The Memory Budget Tradeoff&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/memory-budget-allocation.png&quot; alt=&quot;How engines divide available memory between data cache, sort buffers, hash tables, and network buffers&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every engine must divide its available memory among competing uses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data cache&lt;/strong&gt; (buffer pool, column cache): Reduces disk I/O by keeping hot data in memory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sort buffers&lt;/strong&gt;: Used by ORDER BY, MERGE JOIN, and index creation. Larger buffers mean fewer multi-pass external sorts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hash tables&lt;/strong&gt;: Used by hash joins and hash aggregations. If the hash table does not fit in memory, the engine must spill partitions to disk and process them in multiple passes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network buffers&lt;/strong&gt;: In distributed engines, memory is needed for sending and receiving data during shuffles and broadcasts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tradeoff is direct: every megabyte allocated to caching is a megabyte unavailable for processing. A large buffer pool reduces cache misses but may force hash joins to spill to disk. A large work memory allocation prevents spills but reduces cache hit rates.&lt;/p&gt;
&lt;p&gt;PostgreSQL exposes this tradeoff through separate parameters: &lt;code&gt;shared_buffers&lt;/code&gt; for the buffer pool and &lt;code&gt;work_mem&lt;/code&gt; for per-operation sort/hash memory. Dremio and Snowflake manage this allocation automatically, adjusting the split based on workload characteristics.&lt;/p&gt;
&lt;h2&gt;Spill-to-Disk Strategies&lt;/h2&gt;
&lt;p&gt;When an operation exceeds its memory budget, the engine must &amp;quot;spill&amp;quot; intermediate data to disk and continue processing. This is slower than in-memory processing but prevents out-of-memory failures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;External sort&lt;/strong&gt;: Divide the data into runs that each fit in memory. Sort each run. Write sorted runs to temporary files. Merge the runs using a k-way merge. For very large datasets, this may require multiple merge passes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Grace hash join&lt;/strong&gt;: When the hash table for a join does not fit in memory, partition both sides of the join by hash value and write partitions to disk. Then process each partition pair independently, where each partition&apos;s hash table fits in memory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hybrid hash join&lt;/strong&gt;: Keep as many partitions of the hash table in memory as possible. Only spill the partitions that do not fit. This reduces I/O compared to pure Grace hash join when the data is only slightly larger than memory.&lt;/p&gt;
&lt;p&gt;All major engines support spill-to-disk: PostgreSQL, Spark, Dremio, DuckDB, Snowflake. The key difference is how gracefully performance degrades. Some engines experience a sharp cliff when spilling starts; others degrade gradually.&lt;/p&gt;
&lt;h2&gt;Where Real Systems Land&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Primary Cache&lt;/th&gt;
&lt;th&gt;Eviction Policy&lt;/th&gt;
&lt;th&gt;Spill Strategy&lt;/th&gt;
&lt;th&gt;Cloud Cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Buffer pool (pages)&lt;/td&gt;
&lt;td&gt;Clock&lt;/td&gt;
&lt;td&gt;External sort, hash spill&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL/InnoDB&lt;/td&gt;
&lt;td&gt;Buffer pool (pages)&lt;/td&gt;
&lt;td&gt;LRU with young/old sublists&lt;/td&gt;
&lt;td&gt;External sort&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;OS page cache&lt;/td&gt;
&lt;td&gt;OS-managed&lt;/td&gt;
&lt;td&gt;External sort, hash spill&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;Result cache + local SSD&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Automatic spill&lt;/td&gt;
&lt;td&gt;SSD cache for remote storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;C3 columnar cache + result cache&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Hash spill, sort spill&lt;/td&gt;
&lt;td&gt;NVMe SSD for S3/ADLS/GCS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark&lt;/td&gt;
&lt;td&gt;Unified memory manager&lt;/td&gt;
&lt;td&gt;LRU within pools&lt;/td&gt;
&lt;td&gt;Spill to local disk&lt;/td&gt;
&lt;td&gt;N/A (relies on cluster storage)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse&lt;/td&gt;
&lt;td&gt;OS page cache + marks cache&lt;/td&gt;
&lt;td&gt;OS-managed&lt;/td&gt;
&lt;td&gt;Partial sort spill&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The pattern: OLTP engines cache at the page level with explicit buffer pools. OLAP engines cache at higher levels (columns, results, materialized views) and rely on the OS or local SSDs for lower-level caching.&lt;/p&gt;
&lt;p&gt;Memory management is rarely the most visible part of a database engine, but it is often the most impactful. The difference between a query that runs entirely in memory and one that spills to disk can be 10-100x in execution time.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Writing to an Apache Iceberg Table: How Commits and ACID Actually Work</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:05:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 6 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Part 5&lt;/a&gt; covered hidden partitioning. This article walks through the exact steps an engine takes when writing data to an Iceberg table, when the write becomes visible, and how concurrent writers are handled.&lt;/p&gt;
&lt;p&gt;Understanding the write process is critical because it explains why Iceberg can provide ACID guarantees on top of object storage, something that seems impossible when you consider that S3, ADLS, and GCS have no built-in transaction support. The answer is that ACID lives entirely in the metadata layer, not in storage.&lt;/p&gt;
&lt;h2&gt;The Six Steps of a Write&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/write-process-flow.png&quot; alt=&quot;The Iceberg write process from data file creation through the atomic catalog pointer swap&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every write operation (INSERT, DELETE, UPDATE, MERGE) follows the same six-step sequence:&lt;/p&gt;
&lt;h3&gt;Step 1: Write Data Files&lt;/h3&gt;
&lt;p&gt;The engine writes new Parquet (or ORC/Avro) files to object storage. These files are placed in the table&apos;s data directory but are not yet referenced by any metadata. At this point, they are invisible to all readers. They are just orphan files sitting in storage.&lt;/p&gt;
&lt;h3&gt;Step 2: Create Manifest Entries&lt;/h3&gt;
&lt;p&gt;For each new data file, the engine creates a manifest entry containing the file path, file size, row count, partition values (computed using the table&apos;s &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;partition transforms&lt;/a&gt;), and column-level statistics (min, max, null count).&lt;/p&gt;
&lt;h3&gt;Step 3: Create or Update Manifest Files&lt;/h3&gt;
&lt;p&gt;The engine bundles manifest entries into Avro-format manifest files. If the write affects only a single partition, it may create one new manifest. If it touches many partitions, it may create multiple manifests. Existing manifests from previous snapshots that were not modified are carried forward by reference, not copied.&lt;/p&gt;
&lt;h3&gt;Step 4: Create a Manifest List&lt;/h3&gt;
&lt;p&gt;A new manifest list (Avro) is created that references all manifests for this snapshot: the new manifests from Step 3 plus the unchanged manifests inherited from the previous snapshot. This manifest list represents the complete state of the table after this write.&lt;/p&gt;
&lt;h3&gt;Step 5: Create New Metadata File&lt;/h3&gt;
&lt;p&gt;A new &lt;code&gt;metadata.json&lt;/code&gt; file is written, containing the table schema, partition spec, properties, and the snapshot list. The new snapshot (pointing to the manifest list from Step 4) is appended to the list. The previous &lt;code&gt;metadata.json&lt;/code&gt; remains in storage, unchanged.&lt;/p&gt;
&lt;h3&gt;Step 6: Atomic Commit (The Pointer Swap)&lt;/h3&gt;
&lt;p&gt;The engine asks the &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;catalog&lt;/a&gt; to update its pointer from the old &lt;code&gt;metadata.json&lt;/code&gt; to the new one. This is a compare-and-swap operation: the catalog checks that the current pointer matches what the engine expects, and only then updates it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This is the exact moment the transaction commits.&lt;/strong&gt; Before the swap, readers see the old snapshot. After the swap, readers see the new snapshot. There is no in-between state.&lt;/p&gt;
&lt;h2&gt;Why This Provides ACID Guarantees&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/atomic-commit-pointer-swap.png&quot; alt=&quot;How ACID works through the atomic metadata pointer swap&quot;&gt;&lt;/p&gt;
&lt;p&gt;The pointer swap mechanism delivers all four ACID properties:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Atomicity.&lt;/strong&gt; The entire write is visible or invisible. If the engine crashes after writing data files but before the pointer swap, the data files are orphans. They exist in storage but no metadata references them. Readers never see partial writes. A cleanup process (covered in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Part 10&lt;/a&gt;) can remove these orphans later.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistency.&lt;/strong&gt; The new &lt;code&gt;metadata.json&lt;/code&gt; contains a valid schema, valid partition specs, and consistent statistics. The catalog only accepts the swap if the metadata file is well-formed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Isolation.&lt;/strong&gt; Readers load a specific snapshot and operate on it for the duration of their query. Even if a new snapshot is committed while they are reading, their query continues to see the snapshot they started with. This is snapshot isolation, and it happens naturally because each snapshot is immutable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Durability.&lt;/strong&gt; Once the catalog confirms the pointer swap, the new state is persisted. The metadata file and all data files are already in durable object storage. The catalog&apos;s own persistence layer (a database for &lt;a href=&quot;https://www.dremio.com/blog/what-is-the-iceberg-rest-catalog/&quot;&gt;REST catalogs&lt;/a&gt;, a metastore for Hive) provides the durability guarantee for the pointer itself.&lt;/p&gt;
&lt;h2&gt;Concurrent Writes: Optimistic Concurrency Control&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/concurrent-write-conflict.png&quot; alt=&quot;How two concurrent writers are resolved through optimistic concurrency with retry on conflict&quot;&gt;&lt;/p&gt;
&lt;p&gt;When two engines write to the same table simultaneously, Iceberg uses optimistic concurrency control (OCC):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Both writers read the current metadata&lt;/strong&gt; (say &lt;code&gt;v1.metadata.json&lt;/code&gt;) and begin their writes independently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writer A finishes first&lt;/strong&gt; and successfully swaps the catalog pointer from &lt;code&gt;v1&lt;/code&gt; to &lt;code&gt;v2&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writer B attempts to commit&lt;/strong&gt; by swapping from &lt;code&gt;v1&lt;/code&gt; to &lt;code&gt;v3&lt;/code&gt;. The catalog detects that the current pointer is &lt;code&gt;v2&lt;/code&gt;, not &lt;code&gt;v1&lt;/code&gt;. The swap fails.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writer B retries.&lt;/strong&gt; It reads &lt;code&gt;v2.metadata.json&lt;/code&gt; and checks whether its changes conflict with Writer A&apos;s changes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No conflict (different partitions):&lt;/strong&gt; Writer B&apos;s new files affect partition &lt;code&gt;region=west&lt;/code&gt;, and Writer A&apos;s changes affected &lt;code&gt;region=east&lt;/code&gt;. The changes are compatible. Writer B rebases its manifest list to include Writer A&apos;s manifests and creates a new &lt;code&gt;v3.metadata.json&lt;/code&gt; that reflects both writes. The swap from &lt;code&gt;v2&lt;/code&gt; to &lt;code&gt;v3&lt;/code&gt; succeeds.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Conflict (same files modified):&lt;/strong&gt; Both writers modified the same data files (e.g., both deleted rows from the same file). The changes cannot be automatically merged. Writer B&apos;s operation fails with a conflict error.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This model works well for append-heavy workloads (multiple jobs writing to different partitions), which is the dominant pattern in data lakes. &lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/&quot;&gt;Dremio&lt;/a&gt; handles concurrent writes and automatic retries through its engine, and its &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Open Catalog&lt;/a&gt; provides the atomic compare-and-swap through the REST catalog protocol.&lt;/p&gt;
&lt;h2&gt;Delete and Update Operations&lt;/h2&gt;
&lt;p&gt;Iceberg supports three strategies for modifying existing rows:&lt;/p&gt;
&lt;h3&gt;Copy-on-Write (COW)&lt;/h3&gt;
&lt;p&gt;The engine reads the affected data files, removes or modifies the target rows, and writes entirely new files containing the result. The old files are removed from the manifest (marked as deleted), and the new files are added. This is simple but expensive for large files when only a few rows change.&lt;/p&gt;
&lt;h3&gt;Merge-on-Read (MOR) with Position Delete Files&lt;/h3&gt;
&lt;p&gt;Instead of rewriting data files, the engine writes a small &amp;quot;position delete file&amp;quot; that lists the file path and row positions of deleted rows. At read time, the engine reads both the data file and the delete file, filtering out deleted rows during scan. This makes writes fast but adds read-time overhead.&lt;/p&gt;
&lt;h3&gt;Merge-on-Read with Deletion Vectors (Iceberg v2+)&lt;/h3&gt;
&lt;p&gt;Deletion vectors are a compact bitmap representation of deleted rows within a file. They are more storage-efficient than position delete files and faster to evaluate during reads. Engines like &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Dremio&lt;/a&gt; and Spark use deletion vectors for row-level updates in production.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Write Cost&lt;/th&gt;
&lt;th&gt;Read Cost&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Copy-on-Write&lt;/td&gt;
&lt;td&gt;High (rewrite files)&lt;/td&gt;
&lt;td&gt;Low (clean files)&lt;/td&gt;
&lt;td&gt;Infrequent bulk updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Position Deletes&lt;/td&gt;
&lt;td&gt;Low (small delete file)&lt;/td&gt;
&lt;td&gt;Medium (merge at read)&lt;/td&gt;
&lt;td&gt;Frequent targeted deletes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deletion Vectors&lt;/td&gt;
&lt;td&gt;Low (compact bitmap)&lt;/td&gt;
&lt;td&gt;Low-Medium (bitmap check)&lt;/td&gt;
&lt;td&gt;High-frequency row updates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;What Happens to Old Data?&lt;/h2&gt;
&lt;p&gt;After a commit, the previous snapshot&apos;s data files are not deleted. They remain in storage and are referenced by the old snapshot. This enables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time travel&lt;/strong&gt;: Query the table as of any retained snapshot&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rollback&lt;/strong&gt;: Revert the table to a previous snapshot if a bad write is detected&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Incremental reads&lt;/strong&gt;: Process only the files that changed between two snapshots&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Eventually, old snapshots are expired (removed from the metadata) and their orphan files are cleaned up. This maintenance is covered in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Part 10&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;The Catalog&apos;s Role in Commits&lt;/h2&gt;
&lt;p&gt;The catalog is the gatekeeper of consistency. Without a catalog providing atomic compare-and-swap, concurrent writers could overwrite each other&apos;s commits. The choice of catalog affects write reliability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;REST catalogs&lt;/strong&gt; (&lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;, Polaris) provide server-side CAS operations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hive Metastore&lt;/strong&gt; uses database-level locking for CAS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue&lt;/strong&gt; provides CAS through its API&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hadoop Filesystem&lt;/strong&gt; catalogs use file-system rename atomicity (less reliable on object storage)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;Part 7&lt;/a&gt; covers the catalog landscape in detail.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Volcano, Vectorized, Compiled: How Engines Execute Your Query</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-06/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-06/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:05:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 6 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Part 5&lt;/a&gt; covered how optimizers pick a plan. This article covers what happens next: how the engine actually processes data through the operators in that plan.&lt;/p&gt;
&lt;p&gt;The execution model determines how data flows between operators (scan, filter, join, aggregate) and how each operator processes that data internally. The choice has a direct impact on CPU utilization, and in modern analytical engines where I/O is no longer the primary bottleneck, CPU efficiency is the performance differentiator.&lt;/p&gt;
&lt;h2&gt;Volcano: One Row at a Time&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/volcano-iterator-model.png&quot; alt=&quot;Volcano iterator model showing operators passing one tuple at a time via Next calls&quot;&gt;&lt;/p&gt;
&lt;p&gt;The Volcano model (also called the iterator model) was introduced by Goetz Graefe in 1994 and became the standard execution model for relational databases. PostgreSQL, MySQL, SQLite, and most traditional RDBMS engines use it.&lt;/p&gt;
&lt;p&gt;Every operator implements three methods:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open()&lt;/strong&gt;: Initialize the operator (allocate buffers, open files).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Next()&lt;/strong&gt;: Return the next row. The operator calls Next() on its child operator to get input, processes it, and returns one output row.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Close()&lt;/strong&gt;: Release resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The parent operator &amp;quot;pulls&amp;quot; data from its children one row at a time. A query plan tree of three operators (Scan, Filter, Project) processing 1 million rows results in 1 million Next() calls on each operator, totaling 3 million virtual function calls.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it was good&lt;/strong&gt;: The model is elegant and modular. Adding a new operator means implementing three methods. Operators are composable: any operator can sit on top of any other. Memory usage is minimal because only one row exists in flight at any point.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it struggles on modern hardware&lt;/strong&gt;: Those millions of virtual function calls cause two problems. First, each call has overhead (function pointer indirection, stack frame setup). Second, the CPU&apos;s branch predictor cannot predict virtual dispatch, causing pipeline stalls. For a table with a billion rows and a plan with 5 operators, that is 5 billion function calls where the CPU is stalling instead of computing.&lt;/p&gt;
&lt;h2&gt;Vectorized: Batches of Rows, Column at a Time&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/vectorized-batch-model.png&quot; alt=&quot;Vectorized execution processing batches of 1024 rows with SIMD-friendly tight loops&quot;&gt;&lt;/p&gt;
&lt;p&gt;Vectorized execution keeps the pull-based structure of Volcano but changes the granularity. Instead of returning one row per Next() call, each operator returns a batch (vector) of rows, typically 1024 to 4096 rows.&lt;/p&gt;
&lt;p&gt;Inside each operator, processing happens one column at a time in tight loops. A filter operator checking &lt;code&gt;price &amp;gt; 100&lt;/code&gt; runs a simple loop over the &lt;code&gt;price&lt;/code&gt; column array:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;for i in 0..batch_size:
    selection[i] = prices[i] &amp;gt; 100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This loop has three properties that make it fast:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;No virtual function calls inside the loop&lt;/strong&gt;. The loop body is a direct comparison, not a function pointer dispatch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU cache friendly&lt;/strong&gt;. The &lt;code&gt;prices&lt;/code&gt; array is contiguous in memory. The CPU prefetcher loads the next cache line automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SIMD compatible&lt;/strong&gt;. The compiler can auto-vectorize this loop to process 4-16 values per CPU instruction using SIMD (Single Instruction, Multiple Data) instructions like AVX-256 or AVX-512.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The result: vectorized execution processes analytical queries 5-10x faster than Volcano on the same hardware, purely from CPU efficiency improvements.&lt;/p&gt;
&lt;p&gt;DuckDB, ClickHouse, Dremio, Snowflake, and Velox (Meta&apos;s execution library) all use vectorized execution. DuckDB&apos;s implementation is particularly well-documented as a reference for the approach.&lt;/p&gt;
&lt;h2&gt;Code Generation: Fusing Operators Into Compiled Code&lt;/h2&gt;
&lt;p&gt;Code generation (also called &amp;quot;compiled execution&amp;quot; or &amp;quot;query compilation&amp;quot;) takes a different approach. Instead of passing data between separate operator objects, the engine generates a custom program for each query that fuses multiple operators into a single tight loop.&lt;/p&gt;
&lt;p&gt;For a query &lt;code&gt;SELECT name FROM users WHERE age &amp;gt; 30&lt;/code&gt;, instead of three separate operators (Scan, Filter, Project), the engine generates something equivalent to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;for each row in users_table:
    if row.age &amp;gt; 30:
        emit(row.name)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are no operator boundaries, no Next() calls, no batch transfers. The data stays in CPU registers as long as possible. The generated code is compiled (JIT or ahead-of-time) into native machine instructions.&lt;/p&gt;
&lt;p&gt;Apache Spark uses whole-stage code generation (Tungsten) to fuse chains of operators into single Java methods that the JVM JIT-compiles. Hyper (from the TUM database group, now part of Tableau) and its successor Umbra compile queries into native LLVM code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The tradeoff&lt;/strong&gt;: Generated code is harder to debug, profile, and maintain than modular operator trees. When something goes wrong, you are debugging auto-generated code rather than a clean operator abstraction. Adding new operator types requires integrating them into the code generator rather than implementing a simple interface.&lt;/p&gt;
&lt;h2&gt;Morsel-Driven Parallelism&lt;/h2&gt;
&lt;p&gt;A complementary technique used by DuckDB and Hyper is morsel-driven parallelism. Instead of statically partitioning data across threads at the beginning, the engine divides data into small chunks called &amp;quot;morsels&amp;quot; (typically 10K rows) and assigns them dynamically to worker threads from a shared work queue.&lt;/p&gt;
&lt;p&gt;When a thread finishes its morsel, it picks up the next one. If one thread is slower (due to cache misses, OS scheduling, or data skew), the other threads absorb the remaining work. This achieves near-perfect CPU utilization without the straggler problem that plagues static partitioning.&lt;/p&gt;
&lt;p&gt;Morsel-driven parallelism works particularly well with vectorized execution: each thread processes its morsel using the same vectorized operators, and the morsel boundaries align naturally with batch sizes.&lt;/p&gt;
&lt;h2&gt;The Abstraction vs. Performance Spectrum&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/execution-model-comparison.png&quot; alt=&quot;Three execution models compared: Volcano for simplicity, Vectorized for CPU efficiency, Code Generation for maximum performance&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Data Unit&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;CPU Efficiency&lt;/th&gt;
&lt;th&gt;Modularity&lt;/th&gt;
&lt;th&gt;Systems&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Volcano&lt;/td&gt;
&lt;td&gt;1 row&lt;/td&gt;
&lt;td&gt;High (virtual calls)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL, SQLite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vectorized&lt;/td&gt;
&lt;td&gt;1024+ rows&lt;/td&gt;
&lt;td&gt;Low (batch amortized)&lt;/td&gt;
&lt;td&gt;High (SIMD)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;DuckDB, ClickHouse, Dremio, Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Gen&lt;/td&gt;
&lt;td&gt;Continuous stream&lt;/td&gt;
&lt;td&gt;Minimal (fused code)&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Spark Tungsten, Hyper&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The evolution reflects a shift in bottlenecks. When Volcano was designed in 1994, disk I/O dominated query time. The CPU overhead of per-row function calls was irrelevant compared to waiting for disk seeks. Modern SSDs and in-memory processing have made I/O fast enough that CPU efficiency now determines query performance for many analytical workloads.&lt;/p&gt;
&lt;h2&gt;Hybrid Approaches&lt;/h2&gt;
&lt;p&gt;Some engines combine models. Spark uses code generation for simple operator chains (filter, project, aggregate) but falls back to Volcano-style iteration for complex operators (certain join types, UDFs) that are difficult to fuse.&lt;/p&gt;
&lt;p&gt;Dremio uses vectorized execution with Apache Arrow as the in-memory columnar format. Arrow&apos;s fixed-width column arrays are designed specifically for SIMD-friendly vectorized processing, making the data format and execution model work together.&lt;/p&gt;
&lt;p&gt;PostgreSQL has added JIT compilation (via LLVM) for expression evaluation since version 11, keeping the Volcano operator structure but compiling individual filter and projection expressions into native code. This is a targeted optimization rather than a full model change.&lt;/p&gt;
&lt;h2&gt;When the Model Matters&lt;/h2&gt;
&lt;p&gt;For OLTP workloads (point lookups, small updates), the execution model rarely matters. A query that touches 1-10 rows does not benefit from batch processing or SIMD because the overhead per query (parsing, planning, transaction management) dominates.&lt;/p&gt;
&lt;p&gt;For OLAP workloads (scanning millions to billions of rows), the execution model is one of the most important performance factors. A 10x difference in CPU efficiency on a query that scans 10 billion rows translates to minutes of wall-clock time.&lt;/p&gt;
&lt;p&gt;This is why analytical engines have universally moved away from pure Volcano toward vectorized or compiled execution, while transactional engines have largely stayed with Volcano and focused their optimization efforts elsewhere (buffer management, concurrency control, index efficiency).&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:04:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 5 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Part 4&lt;/a&gt; covered partition evolution. This article covers hidden partitioning, the feature that ensures users never need to know how their data is physically organized.&lt;/p&gt;
&lt;p&gt;The most expensive mistake in data lake querying is the accidental full table scan: a query that reads every file because the user did not correctly reference the partition columns. In Hive, this happens constantly. In Iceberg, it is structurally impossible because users never reference partition columns at all.&lt;/p&gt;
&lt;h2&gt;The Accidental Full Scan Problem&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/exposed-vs-hidden-partitioning.png&quot; alt=&quot;Exposed partitioning in Hive versus hidden partitioning in Iceberg showing the same pruning with different user experience&quot;&gt;&lt;/p&gt;
&lt;p&gt;In Hive, a table partitioned by &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, and &lt;code&gt;day&lt;/code&gt; requires queries to filter on those exact columns:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Hive: This prunes correctly
SELECT * FROM orders WHERE year = 2024 AND month = 3 AND day = 15

-- Hive: This scans EVERYTHING (no pruning)
SELECT * FROM orders WHERE order_date = &apos;2024-03-15&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The second query reads every partition because Hive does not know that &lt;code&gt;order_date&lt;/code&gt; maps to the &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, and &lt;code&gt;day&lt;/code&gt; partition columns. There is no error, no warning. The query simply runs 100x slower than it should.&lt;/p&gt;
&lt;p&gt;This happens because Hive partitioning is &amp;quot;exposed.&amp;quot; The physical partition columns (&lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;) are separate from the source column (&lt;code&gt;order_date&lt;/code&gt;). Users must understand this mapping and construct their filters accordingly.&lt;/p&gt;
&lt;h2&gt;How Iceberg Hides Partitioning&lt;/h2&gt;
&lt;p&gt;Iceberg flips this model. Users filter on the source column (&lt;code&gt;order_date&lt;/code&gt;), and the engine automatically maps the filter to the partition values using &lt;a href=&quot;https://iceberg.apache.org/spec/#partitioning&quot;&gt;transform functions&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Iceberg: This prunes correctly. Always.
SELECT * FROM orders WHERE order_date = &apos;2024-03-15&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The table&apos;s partition spec declares: &lt;code&gt;PARTITIONED BY (day(order_date))&lt;/code&gt;. When the engine processes this query, it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reads the partition spec from the table metadata&lt;/li&gt;
&lt;li&gt;Applies the &lt;code&gt;day()&lt;/code&gt; transform to the filter value: &lt;code&gt;day(&apos;2024-03-15&apos;)&lt;/code&gt; = &lt;code&gt;2024-03-15&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Checks manifest entries for files with matching partition values&lt;/li&gt;
&lt;li&gt;Skips every file whose partition value is not &lt;code&gt;2024-03-15&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The user writes natural SQL against the source columns. The engine handles the physical-to-logical mapping. This is why it is called &amp;quot;hidden&amp;quot; partitioning: the partition structure is invisible to the user.&lt;/p&gt;
&lt;h2&gt;The Six Transform Functions&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/partition-transform-functions.png&quot; alt=&quot;Iceberg&apos;s six partition transform functions showing how each maps source values to partition values&quot;&gt;&lt;/p&gt;
&lt;p&gt;Iceberg defines six &lt;a href=&quot;https://iceberg.apache.org/spec/#partition-transforms&quot;&gt;partition transforms&lt;/a&gt; that map source column values to partition values:&lt;/p&gt;
&lt;h3&gt;Temporal Transforms&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transform&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;year(ts)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03-15 10:30:00&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Low-volume tables, yearly reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;month(ts)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03-15 10:30:00&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Medium-volume tables, monthly queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;day(ts)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03-15 10:30:00&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03-15&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;High-volume tables, daily queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hour(ts)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03-15 10:30:00&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-03-15-10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Very high-volume streaming data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The temporal transforms are hierarchical. If a table is partitioned by &lt;code&gt;day(ts)&lt;/code&gt; and a user filters &lt;code&gt;WHERE ts &amp;gt;= &apos;2024-03-01&apos; AND ts &amp;lt; &apos;2024-04-01&apos;&lt;/code&gt;, the engine recognizes this as a range of days and prunes to only the 31 matching partitions. Engines like &lt;a href=&quot;https://www.dremio.com/blog/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;Dremio&lt;/a&gt; handle this mapping automatically for equality, range, and IN-list predicates.&lt;/p&gt;
&lt;h3&gt;Value Transforms&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transform&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;truncate(N, col)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&apos;New York&apos;&lt;/code&gt; (N=3)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&apos;New&apos;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Grouping strings by prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bucket(N, col)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;12345&lt;/code&gt; (N=16)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Even distribution of high-cardinality columns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;truncate(N, col)&lt;/code&gt;&lt;/strong&gt; takes the first N characters of a string (or truncates a number to a width). This is useful when you want to group data by a string prefix without creating one partition per unique value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;bucket(N, col)&lt;/code&gt;&lt;/strong&gt; applies a hash function and mod N to produce a bucket number from 0 to N-1. This distributes data evenly across a fixed number of buckets, regardless of the column&apos;s value distribution. It is the go-to transform for high-cardinality columns like &lt;code&gt;user_id&lt;/code&gt; or &lt;code&gt;order_id&lt;/code&gt; where identity partitioning would create millions of tiny partitions.&lt;/p&gt;
&lt;h3&gt;The Identity Transform&lt;/h3&gt;
&lt;p&gt;The identity transform (&lt;code&gt;identity(col)&lt;/code&gt;) uses the raw column value as the partition value. This is equivalent to Hive-style partitioning, but the column is still &amp;quot;hidden&amp;quot; because the engine handles the mapping. It is useful for low-cardinality columns like &lt;code&gt;region&lt;/code&gt; or &lt;code&gt;status&lt;/code&gt; where each unique value should be its own partition.&lt;/p&gt;
&lt;h2&gt;How Pruning Works Under the Hood&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/hidden-partition-pruning-flow.png&quot; alt=&quot;Step-by-step flow showing how the engine maps a user query through the partition spec to prune files&quot;&gt;&lt;/p&gt;
&lt;p&gt;The pruning process works in three phases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 1: Predicate translation.&lt;/strong&gt; The engine examines each &lt;code&gt;WHERE&lt;/code&gt; clause predicate and checks if the filtered column is part of the partition spec. If &lt;code&gt;order_date&lt;/code&gt; is the source column for &lt;code&gt;day(order_date)&lt;/code&gt;, the engine can translate &lt;code&gt;order_date = &apos;2024-03-15&apos;&lt;/code&gt; into a partition filter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 2: Manifest list evaluation.&lt;/strong&gt; The manifest list stores partition value ranges per manifest. The engine checks if each manifest&apos;s range includes the target partition value. Manifests whose range does not overlap are skipped entirely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 3: Manifest entry evaluation.&lt;/strong&gt; For each surviving manifest, the engine checks individual file entries. Only files whose partition value matches &lt;code&gt;2024-03-15&lt;/code&gt; are included in the scan plan.&lt;/p&gt;
&lt;p&gt;This is the same pruning cascade described in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Part 3&lt;/a&gt;, but now the partition values were derived automatically from the user&apos;s filter on a source column.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Transform&lt;/h2&gt;
&lt;p&gt;The choice of partition transform depends on data volume and query patterns:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended Transform&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 GB/day of event data&lt;/td&gt;
&lt;td&gt;&lt;code&gt;day(event_time)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Each day is one partition (~10 GB), well-sized files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 TB/day of event data&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hour(event_time)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Each hour is ~42 GB, prevents oversized partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500 MB/month of reports&lt;/td&gt;
&lt;td&gt;&lt;code&gt;month(report_date)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Monthly partitions keep file counts manageable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User-level data, 10M users&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bucket(64, user_id)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Even distribution, avoids millions of tiny partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Region-based data, 5 regions&lt;/td&gt;
&lt;td&gt;&lt;code&gt;identity(region)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Only 5 partitions, each meaningfully distinct&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The goal is to create partitions that are large enough to contain optimally-sized files (128-512 MB each) but small enough that partition pruning eliminates most files for typical queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-partitioning&lt;/strong&gt; (too many small partitions) creates the small file problem: thousands of tiny files that bloat metadata and slow query planning. &lt;strong&gt;Under-partitioning&lt;/strong&gt; (too few large partitions) reduces pruning effectiveness because each partition contains too much data.&lt;/p&gt;
&lt;h2&gt;Combining Transforms&lt;/h2&gt;
&lt;p&gt;Iceberg supports multi-column partition specs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE events (
  event_id BIGINT,
  event_time TIMESTAMP,
  user_id BIGINT,
  event_type STRING
) PARTITIONED BY (day(event_time), bucket(32, user_id))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a two-dimensional partition space: each combination of day and user bucket is a separate partition. Queries filtering on &lt;code&gt;event_time&lt;/code&gt; get day-level pruning. Queries filtering on &lt;code&gt;user_id&lt;/code&gt; get bucket-level pruning. Queries filtering on both get pruning from both dimensions.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Dremio&lt;/a&gt; supports all Iceberg transform functions and automatically applies pruning for any combination of partition columns in the query&apos;s WHERE clause.&lt;/p&gt;
&lt;h2&gt;Why This Matters for Teams&lt;/h2&gt;
&lt;p&gt;Hidden partitioning changes the operational model for data teams:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data engineers&lt;/strong&gt; define the partition strategy once in the table&apos;s partition spec. They can change it later through &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;partition evolution&lt;/a&gt; without breaking anything.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Analysts and data scientists&lt;/strong&gt; write natural SQL against the business columns they understand. They never need to know whether the table is partitioned by day, month, or bucket. Their queries are automatically optimized.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BI tools and dashboards&lt;/strong&gt; connect to Iceberg tables and issue standard SQL. The tools do not need to understand Iceberg&apos;s partitioning; the engine handles the optimization. This is why hidden partitioning is essential for self-service analytics platforms like &lt;a href=&quot;https://www.dremio.com/platform/semantic-layer/&quot;&gt;Dremio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The net result: no accidental full table scans, no partition-aware query patterns required from users, and the ability to change the physical layout without impacting any downstream consumer. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Part 6&lt;/a&gt; covers what happens when data is written to an Iceberg table.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Inside the Query Optimizer: How Engines Pick a Plan</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-05/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-05/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:04:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 5 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;Part 4&lt;/a&gt; covered indexing strategies. This article covers what happens after the engine parses your SQL: how the optimizer decides the fastest way to execute it.&lt;/p&gt;
&lt;p&gt;The same SQL query can be executed in hundreds of different ways. The tables can be joined in different orders. Filters can be applied early or late. Indexes can be used or ignored. The optimizer&apos;s job is to find a plan that finishes quickly without spending too much time searching for it.&lt;/p&gt;
&lt;h2&gt;From SQL to Execution Plan&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/sql-to-execution-plan.png&quot; alt=&quot;How optimizers transform SQL through logical and physical plan stages into an execution plan&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every query goes through three stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Parse&lt;/strong&gt;: The SQL text is converted into an abstract syntax tree (AST). Syntax errors are caught here.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logical plan&lt;/strong&gt;: The AST becomes a tree of logical operators (Scan, Filter, Join, Aggregate, Project). This plan describes &lt;em&gt;what&lt;/em&gt; to compute but not &lt;em&gt;how&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Physical plan&lt;/strong&gt;: The optimizer selects specific algorithms for each logical operator. A logical Join becomes a HashJoin or SortMergeJoin. A logical Scan becomes an IndexScan or SequentialScan. This plan describes exactly how the engine will execute the query.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The gap between the logical and physical plan is where the optimizer earns its keep. The same logical plan can produce dozens of physical plans with performance differences of 10x to 1000x.&lt;/p&gt;
&lt;h2&gt;Rule-Based Optimization: The Guaranteed Wins&lt;/h2&gt;
&lt;p&gt;Rule-based optimization (RBO) applies fixed transformation rules that always improve the plan. These are deterministic: the optimizer applies every applicable rule without evaluating alternatives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Predicate pushdown&lt;/strong&gt; moves filter operations closer to the data source. If a query joins two tables and then filters, the optimizer pushes the filter below the join so fewer rows enter the join. This reduces the intermediate result size, which reduces memory usage, network transfer (in distributed engines), and CPU time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Projection pruning&lt;/strong&gt; removes columns that are never referenced downstream. If a table has 50 columns but the query only uses 3, the optimizer drops the other 47 from the scan operator. Combined with columnar storage, this means 94% of the data is never read.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Constant folding&lt;/strong&gt; evaluates constant expressions at planning time. &lt;code&gt;WHERE date &amp;gt; &apos;2024-01-01&apos; AND 1 = 1&lt;/code&gt; becomes &lt;code&gt;WHERE date &amp;gt; &apos;2024-01-01&apos;&lt;/code&gt; before execution starts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Predicate simplification&lt;/strong&gt; rewrites complex conditions. &lt;code&gt;WHERE x &amp;gt; 5 AND x &amp;gt; 10&lt;/code&gt; becomes &lt;code&gt;WHERE x &amp;gt; 10&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Every production query engine applies these rules: PostgreSQL, MySQL, Dremio, Snowflake, Spark, DuckDB, ClickHouse, Trino. They are cheap to compute and always beneficial.&lt;/p&gt;
&lt;h2&gt;Cost-Based Optimization: Searching for the Best Plan&lt;/h2&gt;
&lt;p&gt;Rule-based optimization handles the obvious improvements but cannot answer the hard questions: which join order is fastest? Should we use a hash join or a sort-merge join? Should we scan the index or the full table?&lt;/p&gt;
&lt;p&gt;Cost-based optimization (CBO) answers these by estimating the cost of multiple candidate plans and selecting the cheapest one.&lt;/p&gt;
&lt;h3&gt;How Cost Estimation Works&lt;/h3&gt;
&lt;p&gt;The optimizer maintains &lt;strong&gt;table statistics&lt;/strong&gt;: row counts, column cardinality (number of distinct values), value distribution histograms, null counts, and average column widths. PostgreSQL stores these in &lt;code&gt;pg_statistic&lt;/code&gt; and updates them via &lt;code&gt;ANALYZE&lt;/code&gt;. Dremio and Snowflake collect statistics automatically during query execution and table maintenance.&lt;/p&gt;
&lt;p&gt;For each candidate plan, the optimizer estimates:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cardinality&lt;/strong&gt;: How many rows will each operator produce? A filter on &lt;code&gt;status = &apos;active&apos;&lt;/code&gt; with cardinality 5 on a 1M-row table produces approximately 200K rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost&lt;/strong&gt;: How much CPU, I/O, and memory will each operator consume? A sequential scan costs proportionally to table size. An index scan costs proportionally to result size plus index traversal.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The optimizer generates multiple plan candidates (different join orders, different join algorithms, different access methods) and picks the one with the lowest estimated total cost.&lt;/p&gt;
&lt;h3&gt;Why Join Order Matters So Much&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/join-reordering.png&quot; alt=&quot;Join reordering showing how the same three-table query can be 10x faster or slower depending on which tables are joined first&quot;&gt;&lt;/p&gt;
&lt;p&gt;For a query joining three tables (Orders: 1B rows, Products: 100K rows, Customers: 10M rows):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bad order&lt;/strong&gt;: Join Orders with Products first. The intermediate result is up to 1B rows. Then join with Customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Good order&lt;/strong&gt;: Join Products with Customers first. The intermediate result is at most 100K rows. Then join with Orders.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The second plan produces a 10,000x smaller intermediate result. For queries with 5-10 joins, the number of possible orderings grows factorially, and the performance gap between the best and worst order can exceed 1000x.&lt;/p&gt;
&lt;p&gt;This is why cost-based optimization exists. Rule-based optimization cannot determine join order because the &amp;quot;best&amp;quot; order depends on the actual data sizes, which require statistics.&lt;/p&gt;
&lt;h3&gt;The Cardinality Estimation Problem&lt;/h3&gt;
&lt;p&gt;CBO&apos;s Achilles&apos; heel is cardinality estimation. If the optimizer estimates that a filter produces 100 rows but it actually produces 10 million, every downstream cost estimate is wrong. The optimizer may choose a nested loop join (efficient for small inputs) when a hash join (efficient for large inputs) would have been 100x faster.&lt;/p&gt;
&lt;p&gt;Common sources of estimation error:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Correlated columns&lt;/strong&gt;: The optimizer assumes independence between predicates. &lt;code&gt;WHERE city = &apos;Seattle&apos; AND state = &apos;WA&apos;&lt;/code&gt; is estimated as if city and state are independent, drastically underestimating the result.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stale statistics&lt;/strong&gt;: If statistics were collected before a large data load, the estimates are based on the old distribution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Complex expressions&lt;/strong&gt;: Functions, LIKE patterns, and nested subqueries are difficult to estimate accurately.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;PostgreSQL addresses this with extended statistics (CREATE STATISTICS for correlated columns). But no optimizer fully solves the estimation problem.&lt;/p&gt;
&lt;h2&gt;Adaptive Query Execution: Fixing Plans at Runtime&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/adaptive-query-execution.png&quot; alt=&quot;Adaptive query execution changing a shuffle join to a broadcast join mid-flight based on actual data sizes&quot;&gt;&lt;/p&gt;
&lt;p&gt;The most modern approach is to stop trusting planning-time estimates and adjust the plan during execution based on actual observed data sizes.&lt;/p&gt;
&lt;p&gt;Apache Spark introduced Adaptive Query Execution (AQE) in Spark 3.0. After a shuffle stage completes, AQE checks the actual size of the shuffled data. If one side of a join turns out to be small enough, AQE switches from a shuffle join to a broadcast join. If partitions are too small, AQE coalesces them to reduce overhead. If data is skewed, AQE splits the hot partition.&lt;/p&gt;
&lt;p&gt;Dremio, Snowflake, and other distributed engines use similar adaptive techniques: adjusting parallelism, switching join strategies, and coalescing small tasks based on runtime observations.&lt;/p&gt;
&lt;p&gt;The tradeoff: adaptive execution adds overhead at stage boundaries (must collect and analyze statistics) and cannot change decisions that have already been executed. It works best in distributed engines where the planning-time uncertainty is highest.&lt;/p&gt;
&lt;h2&gt;Where Real Systems Land&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Rule-Based&lt;/th&gt;
&lt;th&gt;Cost-Based&lt;/th&gt;
&lt;th&gt;Adaptive&lt;/th&gt;
&lt;th&gt;Statistics Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (advanced)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ANALYZE&lt;/code&gt; command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ANALYZE TABLE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Automatic collection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (Catalyst)&lt;/td&gt;
&lt;td&gt;Yes (AQE)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ANALYZE TABLE&lt;/code&gt;, runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Automatic sampling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Per-part statistics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trino&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Connector statistics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The pattern: simpler engines (ClickHouse) prioritize fast planning with rule-based optimization. Complex distributed engines (Spark, Dremio, Snowflake) invest heavily in cost-based and adaptive optimization because the performance stakes are higher when queries span multiple nodes and terabytes of data.&lt;/p&gt;
&lt;h2&gt;The Meta-Tradeoff: Planning Time vs. Execution Time&lt;/h2&gt;
&lt;p&gt;There is a cost to optimization itself. Exploring thousands of join orderings for a 10-table query takes time. For a simple point lookup that runs in milliseconds, spending 100ms on optimization is wasteful. For a complex analytical query that runs for minutes, spending 5 seconds on optimization to find a 10x better plan is a bargain.&lt;/p&gt;
&lt;p&gt;Most engines set timeouts on the optimization search. If the optimizer has not found a better plan within a time budget, it stops and uses the best plan found so far. This means complex queries with many joins sometimes get suboptimal plans because the search space was too large to explore fully.&lt;/p&gt;
&lt;p&gt;The optimizer is always making a bet: invest time now to save time later. How much to invest depends on how long the query is expected to run.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Partition Evolution: Change Your Partitioning Without Rewriting Data</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:03:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 4 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Part 3&lt;/a&gt; covered metadata-driven performance. This article explains how Iceberg handles the problem that has plagued data lakes for over a decade: what happens when your partition strategy needs to change.&lt;/p&gt;
&lt;p&gt;Partitioning determines how data is physically organized in storage, and it is the single most impactful factor for query performance on large tables. Get it right and queries skip 95% of the data. Get it wrong and every query scans everything. The problem is that requirements change, data volumes grow, and the partition strategy that worked last year becomes a bottleneck this year.&lt;/p&gt;
&lt;h2&gt;The Hive Problem: Partitioning Is Permanent&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/hive-partition-directories.png&quot; alt=&quot;Hive-style directory-based partitioning with its three core problems&quot;&gt;&lt;/p&gt;
&lt;p&gt;In Hive and other traditional data lake systems, partitions are physical directories. A table partitioned by year and month has a directory structure like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3://warehouse/orders/year=2023/month=01/part-0000.parquet
s3://warehouse/orders/year=2023/month=02/part-0000.parquet
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This design has three fundamental problems:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Changing partitions requires rewriting all data.&lt;/strong&gt; If a table is partitioned by month and you need daily partitions (because data volume grew and monthly partitions are now too large for efficient queries), you must read every file, re-partition it, and write it back. For a petabyte table, this means a petabyte of compute and I/O, hours of processing, and downtime for consumers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Users must know the physical layout.&lt;/strong&gt; Queries must explicitly reference partition columns using the exact partition column names: &lt;code&gt;WHERE year = 2024 AND month = 3&lt;/code&gt;. If a user writes &lt;code&gt;WHERE order_date = &apos;2024-03-15&apos;&lt;/code&gt;, Hive does not recognize that &lt;code&gt;order_date&lt;/code&gt; maps to &lt;code&gt;year = 2024, month = 3&lt;/code&gt;, and it scans the entire table. This creates a constant burden on users to understand and correctly use the physical layout.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wrong filters produce silent full scans.&lt;/strong&gt; There is no error, no warning. The query runs, it just reads every partition. Teams discover the problem only when they notice query times are 50x slower than expected.&lt;/p&gt;
&lt;h2&gt;How Iceberg Solves This&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/iceberg-partition-evolution.png&quot; alt=&quot;Iceberg partition evolution showing how old and new partition specs coexist without rewriting data&quot;&gt;&lt;/p&gt;
&lt;p&gt;Iceberg separates the logical partition specification from the physical data layout through two mechanisms: hidden partitioning (covered in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Part 5&lt;/a&gt;) and partition evolution.&lt;/p&gt;
&lt;h3&gt;The Partition Spec&lt;/h3&gt;
&lt;p&gt;Every Iceberg table has a &lt;a href=&quot;https://iceberg.apache.org/spec/#partitioning&quot;&gt;partition spec&lt;/a&gt; that defines how source columns map to partition values. The spec does not create directories. Instead, it records partition values as metadata in manifest entries alongside each data file.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a table partitioned by month
CREATE TABLE orders (
  order_id BIGINT,
  order_date DATE,
  amount DECIMAL(10,2),
  status STRING
) PARTITIONED BY (month(order_date))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When data is written, the engine computes the partition value (&lt;code&gt;month(&apos;2024-03-15&apos;)&lt;/code&gt; = &lt;code&gt;2024-03&lt;/code&gt;) and stores it in the manifest entry for that file. The file itself can live at any path; there is no requirement for a &lt;code&gt;month=2024-03/&lt;/code&gt; directory.&lt;/p&gt;
&lt;h3&gt;Evolving the Spec&lt;/h3&gt;
&lt;p&gt;When data volume grows and monthly partitions become too coarse, you change the spec:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE orders SET PARTITION SPEC (day(order_date))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a metadata-only operation. It takes milliseconds. No data is read or rewritten. What happens internally:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The current partition spec (Spec 0: &lt;code&gt;month(order_date)&lt;/code&gt;) is preserved in the table&apos;s metadata history.&lt;/li&gt;
&lt;li&gt;A new partition spec (Spec 1: &lt;code&gt;day(order_date)&lt;/code&gt;) is set as the active spec.&lt;/li&gt;
&lt;li&gt;All existing data files retain their Spec 0 partition values in their manifest entries.&lt;/li&gt;
&lt;li&gt;All new data written to the table uses Spec 1.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The table now contains files with two different partition specs. This is not a broken state. It is the designed behavior.&lt;/p&gt;
&lt;h2&gt;How Query Planning Handles Multiple Specs&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/multi-spec-query-planning.png&quot; alt=&quot;How engines resolve queries across multiple partition specs by evaluating each independently&quot;&gt;&lt;/p&gt;
&lt;p&gt;When a query filters on &lt;code&gt;order_date&lt;/code&gt;, the engine must correctly prune files regardless of which spec they were written under. Here is the process:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM orders
WHERE order_date BETWEEN &apos;2023-12-01&apos; AND &apos;2024-01-31&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;For Spec 0 files (monthly)&lt;/strong&gt;: The engine translates the date range into month values: &lt;code&gt;2023-12&lt;/code&gt; and &lt;code&gt;2024-01&lt;/code&gt;. It checks manifest entries with Spec 0 partition values and keeps files where the month partition is either &lt;code&gt;2023-12&lt;/code&gt; or &lt;code&gt;2024-01&lt;/code&gt;. All other months are skipped.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For Spec 1 files (daily)&lt;/strong&gt;: The engine translates the date range into day values: &lt;code&gt;2024-01-01&lt;/code&gt; through &lt;code&gt;2024-01-31&lt;/code&gt;. It checks manifest entries with Spec 1 partition values and keeps files where the day partition falls within that range. All other days are skipped.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Both old and new files are correctly pruned using their respective specs. The query returns accurate results from files written under different partition strategies, without the user knowing or caring about the spec history.&lt;/p&gt;
&lt;h2&gt;Real-World Scenarios&lt;/h2&gt;
&lt;h3&gt;Growing From Monthly to Daily&lt;/h3&gt;
&lt;p&gt;The most common evolution. A startup begins with monthly partitions when data volume is 10 GB/month. Two years later, data volume is 500 GB/month and monthly partitions produce files too large for efficient processing. Evolving to daily partitions makes new data more granular while old data remains accessible.&lt;/p&gt;
&lt;h3&gt;Adding a Partition Column&lt;/h3&gt;
&lt;p&gt;A table partitioned only by date starts receiving queries that heavily filter by region. Adding a partition on region (using &lt;code&gt;bucket(16, region)&lt;/code&gt;) improves pruning for those queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE orders SET PARTITION SPEC (day(order_date), bucket(16, region))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Old files remain partitioned by date only. New files are partitioned by both date and region. Queries that filter on date work correctly for both old and new files. Queries that filter on region get pruning benefits only for new files.&lt;/p&gt;
&lt;h3&gt;Removing a Partition Column&lt;/h3&gt;
&lt;p&gt;If a partition column becomes irrelevant (e.g., a geographic region is no longer used for filtering), you can evolve the spec to remove it. Old files keep their partition values, but new files are no longer organized by that column. &lt;a href=&quot;https://www.dremio.com/blog/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;Dremio&lt;/a&gt; and other engines handle this transparently during query planning.&lt;/p&gt;
&lt;h2&gt;What About the Old Data?&lt;/h2&gt;
&lt;p&gt;After a partition evolution, old data continues to work correctly but may have suboptimal organization. The old monthly files are coarser than the new daily files, meaning queries against historical data scan larger files than necessary.&lt;/p&gt;
&lt;p&gt;Two options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Leave it alone.&lt;/strong&gt; If historical data is queried infrequently, the cost of less-optimal pruning is minimal. This is the zero-effort approach.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compact old data.&lt;/strong&gt; Run a &lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/&quot;&gt;compaction job&lt;/a&gt; that rewrites old files under the new spec. This produces daily-partitioned files for the historical data too, but requires compute resources. Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;automatic table optimization&lt;/a&gt; can handle this for tables managed by Open Catalog.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;How Other Formats Handle This&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Partition Change Approach&lt;/th&gt;
&lt;th&gt;Data Rewrite?&lt;/th&gt;
&lt;th&gt;Multiple Specs?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metadata-only spec evolution&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes, coexist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Liquid Clustering (adaptive)&lt;/td&gt;
&lt;td&gt;Background rewrite&lt;/td&gt;
&lt;td&gt;N/A (clustering-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hudi&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Re-partition with full rewrite&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full table rewrite&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Delta Lake&apos;s Liquid Clustering is a different solution to the same problem. Instead of static partitions, it uses adaptive clustering that reorganizes data in the background. The tradeoff: Liquid Clustering requires ongoing background compute, while Iceberg&apos;s partition evolution is a one-time metadata operation with optional follow-up compaction.&lt;/p&gt;
&lt;p&gt;Partition evolution is one of the features that makes Iceberg a safe long-term choice. It means the partitioning decision you make today is not permanent. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Part 5&lt;/a&gt; covers hidden partitioning, the other half of Iceberg&apos;s partitioning story.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-04/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-04/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:03:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 4 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;Part 3&lt;/a&gt; covered how data is structured within files. This article covers the auxiliary data structures that make lookups fast: indexes.&lt;/p&gt;
&lt;p&gt;Every index exists to answer the same question faster: &amp;quot;where is the data I need?&amp;quot; The fundamental tradeoff is universal: every index speeds up reads and slows down writes, because every insert, update, or delete must also update every index on the table.&lt;/p&gt;
&lt;h2&gt;B-Trees: The OLTP Standard&lt;/h2&gt;
&lt;p&gt;The B-tree is the most widely deployed index structure in production databases. PostgreSQL, MySQL, Oracle, SQL Server, SQLite, and CockroachDB all default to B-tree indexes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/btree-structure.png&quot; alt=&quot;B-tree structure showing root, internal, and leaf nodes with point lookup and range scan paths&quot;&gt;&lt;/p&gt;
&lt;p&gt;A B-tree is a balanced tree where each node contains sorted keys and pointers. The tree stays balanced because splits and merges propagate upward when nodes get too full or too empty.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Point lookups&lt;/strong&gt; traverse from root to leaf: O(log n) comparisons. For a table with a billion rows and a branching factor of 100, that is approximately 5 node reads. If the upper levels are cached in memory (they usually are), a point lookup hits disk once.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Range scans&lt;/strong&gt; find the starting leaf and follow horizontal pointers across adjacent leaves. The scan is sequential I/O, which is the fastest access pattern on both SSDs and HDDs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Writes&lt;/strong&gt; are where B-trees pay their cost. Inserting a key may trigger a node split that propagates up the tree. Updates are in-place random writes. Under heavy write loads, fragmentation accumulates and periodic rebuilding or vacuuming is needed. PostgreSQL&apos;s VACUUM process exists specifically to reclaim space from B-tree bloat.&lt;/p&gt;
&lt;h2&gt;LSM Trees: Built for Write Throughput&lt;/h2&gt;
&lt;p&gt;When write volume overwhelms B-tree performance, LSM trees offer an alternative that converts random writes into sequential writes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/lsm-tree-write-path.png&quot; alt=&quot;LSM tree write path from memtable through WAL flush to SSTables across multiple levels with compaction&quot;&gt;&lt;/p&gt;
&lt;p&gt;The architecture has three layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Memtable&lt;/strong&gt;: An in-memory sorted structure (typically a skip list or red-black tree). All writes go here first. A Write-Ahead Log (WAL) on disk ensures durability if the process crashes before the memtable is flushed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSTables&lt;/strong&gt;: When the memtable fills up, it is flushed to disk as an immutable sorted file. These files are never modified after creation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction&lt;/strong&gt;: Background processes merge SSTables from the same level into larger files at the next level, removing duplicate keys and tombstones (deletion markers).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;RocksDB (used as the storage engine inside CockroachDB, TiDB, and many others), LevelDB, Cassandra, HBase, and ScyllaDB all use LSM trees.&lt;/p&gt;
&lt;p&gt;The write advantage is dramatic: because the memtable buffers writes in memory and flushes them sequentially, the disk sees only large sequential writes instead of random scattered writes. For write-heavy workloads (event logging, time-series data, IoT telemetry), LSM trees handle 5-10x more writes per second than B-trees on the same hardware.&lt;/p&gt;
&lt;p&gt;The read tradeoff: a point lookup may need to check the memtable plus multiple levels of SSTables. A key that was written long ago could live in the deepest level, requiring reads across several files. Bloom filters mitigate this: a compact probabilistic structure attached to each SSTable answers &amp;quot;is this key definitely not in this file?&amp;quot; with no false negatives, allowing the engine to skip files without reading them.&lt;/p&gt;
&lt;h2&gt;Bitmap Indexes: OLAP Filtering&lt;/h2&gt;
&lt;p&gt;Bitmap indexes take a different approach entirely. For each distinct value in a column, the index stores a bit vector where each bit represents a row. A 1 means the row has that value. A 0 means it does not.&lt;/p&gt;
&lt;p&gt;For a &lt;code&gt;status&lt;/code&gt; column with three values (&lt;code&gt;active&lt;/code&gt;, &lt;code&gt;pending&lt;/code&gt;, &lt;code&gt;closed&lt;/code&gt;), the index stores three bit vectors, each with one bit per row. Filtering &lt;code&gt;WHERE status = &apos;active&apos; AND region = &apos;US&apos;&lt;/code&gt; becomes a bitwise AND between two bit vectors, which modern CPUs execute in nanoseconds.&lt;/p&gt;
&lt;p&gt;Bitmap indexes are excellent for low-cardinality columns (few distinct values) in read-heavy OLAP workloads. Oracle&apos;s data warehouse features and some specialized OLAP engines use them.&lt;/p&gt;
&lt;p&gt;The write tradeoff is severe: updating a single row in a bitmap index requires locking and modifying the entire bit segment. Under concurrent writes, this creates contention that kills throughput. Bitmap indexes are effectively read-only structures that get rebuilt during batch loads.&lt;/p&gt;
&lt;h2&gt;Zone Maps and Min/Max Indexes&lt;/h2&gt;
&lt;p&gt;Columnar engines like Dremio, Snowflake, ClickHouse, DuckDB, and Spark do not typically use traditional indexes at all. Instead, they rely on zone maps: per-block metadata storing the minimum and maximum value for each column.&lt;/p&gt;
&lt;p&gt;When a query filters &lt;code&gt;WHERE order_date &amp;gt; &apos;2024-06-01&apos;&lt;/code&gt;, the engine checks each block&apos;s max &lt;code&gt;order_date&lt;/code&gt;. Any block where the max is before June 2024 is skipped entirely. No tree traversal, no separate index structure, just a few bytes of metadata per block.&lt;/p&gt;
&lt;p&gt;Zone maps are &amp;quot;almost free&amp;quot; to maintain because the min/max values are computed during the write process with negligible overhead. The tradeoff: they only help with range predicates, and they are useless if the data within each block is randomly ordered (the min and max span the entire value range, so nothing gets skipped). This is why columnar engines often sort or cluster data by frequently filtered columns.&lt;/p&gt;
&lt;p&gt;Dremio automates this through its clustering table maintenance, and Iceberg&apos;s manifest files store per-file column statistics that enable file-level pruning before any data files are opened.&lt;/p&gt;
&lt;h2&gt;Inverted Indexes: Full-Text Search&lt;/h2&gt;
&lt;p&gt;Elasticsearch, Apache Lucene, and Solr use inverted indexes: a mapping from each term to the list of documents containing it. Searching for &amp;quot;query engine optimization&amp;quot; finds the intersection of the posting lists for &amp;quot;query,&amp;quot; &amp;quot;engine,&amp;quot; and &amp;quot;optimization.&amp;quot;&lt;/p&gt;
&lt;p&gt;Inverted indexes are the reason text search engines return results in milliseconds across billions of documents. They are highly specialized and not used for general-purpose relational queries.&lt;/p&gt;
&lt;h2&gt;The Tradeoff Matrix&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/index-tradeoff-matrix.png&quot; alt=&quot;Index type tradeoff matrix comparing read speed, write cost, memory cost, and best use case for B-trees, LSM trees, bitmap indexes, bloom filters, and zone maps&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Index Type&lt;/th&gt;
&lt;th&gt;Read Speed&lt;/th&gt;
&lt;th&gt;Write Cost&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;B-tree&lt;/td&gt;
&lt;td&gt;O(log n) point + range&lt;/td&gt;
&lt;td&gt;Moderate (in-place, splits)&lt;/td&gt;
&lt;td&gt;OLTP mixed workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LSM tree&lt;/td&gt;
&lt;td&gt;Moderate (multi-level search)&lt;/td&gt;
&lt;td&gt;Low (sequential flushes)&lt;/td&gt;
&lt;td&gt;Write-heavy workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bitmap&lt;/td&gt;
&lt;td&gt;Excellent for boolean filters&lt;/td&gt;
&lt;td&gt;Very high (locking, rebuild)&lt;/td&gt;
&lt;td&gt;Low-cardinality OLAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bloom filter&lt;/td&gt;
&lt;td&gt;Fast membership test&lt;/td&gt;
&lt;td&gt;Low (hash at write time)&lt;/td&gt;
&lt;td&gt;Reducing LSM read amplification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zone map&lt;/td&gt;
&lt;td&gt;Fast range pruning&lt;/td&gt;
&lt;td&gt;Very low (compute at write)&lt;/td&gt;
&lt;td&gt;Columnar scan skipping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inverted index&lt;/td&gt;
&lt;td&gt;Fast term lookup&lt;/td&gt;
&lt;td&gt;Moderate (posting list updates)&lt;/td&gt;
&lt;td&gt;Full-text search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Where Real Systems Land&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Primary Index&lt;/th&gt;
&lt;th&gt;Secondary Indexes&lt;/th&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;B-tree&lt;/td&gt;
&lt;td&gt;GIN, GiST, BRIN, hash&lt;/td&gt;
&lt;td&gt;OLTP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL/InnoDB&lt;/td&gt;
&lt;td&gt;B-tree (clustered)&lt;/td&gt;
&lt;td&gt;Secondary B-trees&lt;/td&gt;
&lt;td&gt;OLTP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RocksDB&lt;/td&gt;
&lt;td&gt;LSM tree&lt;/td&gt;
&lt;td&gt;Bloom filters&lt;/td&gt;
&lt;td&gt;Write-heavy storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cassandra&lt;/td&gt;
&lt;td&gt;LSM tree + partition index&lt;/td&gt;
&lt;td&gt;Materialized views, SAI&lt;/td&gt;
&lt;td&gt;Write-heavy distributed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse&lt;/td&gt;
&lt;td&gt;Sparse primary index + zone maps&lt;/td&gt;
&lt;td&gt;Data skipping indexes&lt;/td&gt;
&lt;td&gt;Real-time OLAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;Zone maps&lt;/td&gt;
&lt;td&gt;ART indexes (adaptive)&lt;/td&gt;
&lt;td&gt;Embedded OLAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;Zone maps (micro-partition pruning)&lt;/td&gt;
&lt;td&gt;None (scan-based)&lt;/td&gt;
&lt;td&gt;Cloud OLAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;Zone maps + Iceberg manifest stats&lt;/td&gt;
&lt;td&gt;Bloom filter pruning&lt;/td&gt;
&lt;td&gt;Lakehouse OLAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elasticsearch&lt;/td&gt;
&lt;td&gt;Inverted index&lt;/td&gt;
&lt;td&gt;Doc values (columnar)&lt;/td&gt;
&lt;td&gt;Full-text search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The pattern is clear: OLTP systems invest in B-trees for balanced read/write. Write-heavy systems use LSM trees. Analytical systems minimize index overhead with zone maps and rely on columnar layout for scan efficiency.&lt;/p&gt;
&lt;p&gt;No single indexing strategy works for all workloads. The right choice depends on whether your bottleneck is read latency, write throughput, or scan efficiency.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Performance and Apache Iceberg&apos;s Metadata</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:02:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 3 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;Part 2&lt;/a&gt; covered the metadata structures of all five table formats. This article focuses on exactly how query engines use Iceberg&apos;s metadata to avoid reading data they don&apos;t need.&lt;/p&gt;
&lt;p&gt;The single biggest performance advantage of Iceberg over raw data lakes is not a clever algorithm or a faster codec. It is metadata-driven data skipping. By the time a query engine begins scanning actual Parquet files, Iceberg&apos;s metadata has already eliminated 90-99% of the files from consideration. Understanding this process explains why Iceberg tables with billions of rows can return query results in seconds.&lt;/p&gt;
&lt;h2&gt;The Scan Planning Pipeline&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/scan-planning-cascade.png&quot; alt=&quot;Iceberg scan planning cascade showing how metadata progressively eliminates files at each stage&quot;&gt;&lt;/p&gt;
&lt;p&gt;When a query engine like &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-metadata-for-performance/&quot;&gt;Dremio&lt;/a&gt;, Spark, or Trino receives a query against an Iceberg table, it executes a four-stage planning pipeline before reading any data:&lt;/p&gt;
&lt;h3&gt;Stage 1: Snapshot Resolution&lt;/h3&gt;
&lt;p&gt;The engine contacts the catalog to get the current metadata file location. It reads &lt;code&gt;metadata.json&lt;/code&gt; and identifies the current snapshot. This tells the engine which manifest list represents the table&apos;s current state.&lt;/p&gt;
&lt;p&gt;If the query includes a time travel clause (&lt;code&gt;AS OF TIMESTAMP &apos;2024-03-01&apos;&lt;/code&gt;), the engine scans the snapshot list in &lt;code&gt;metadata.json&lt;/code&gt; to find the snapshot that was current at that timestamp. This is a metadata-only operation; no data files are touched.&lt;/p&gt;
&lt;h3&gt;Stage 2: Manifest List Pruning&lt;/h3&gt;
&lt;p&gt;The manifest list contains one entry per manifest file. Each entry includes partition-level summary statistics: the minimum and maximum values of the partition columns across all data files tracked by that manifest.&lt;/p&gt;
&lt;p&gt;The engine evaluates query predicates against these summaries. If a query filters on &lt;code&gt;order_date = &apos;2024-03-15&apos;&lt;/code&gt; and a manifest&apos;s partition summary shows its date range is &lt;code&gt;2024-01 to 2024-02&lt;/code&gt;, that entire manifest is skipped. This single check can eliminate hundreds of manifest files and the thousands of data files they reference.&lt;/p&gt;
&lt;h3&gt;Stage 3: Manifest File Pruning (File Skipping)&lt;/h3&gt;
&lt;p&gt;For each surviving manifest, the engine reads the individual file entries. Each entry contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;File path and size&lt;/li&gt;
&lt;li&gt;Row count&lt;/li&gt;
&lt;li&gt;Partition values for this specific file&lt;/li&gt;
&lt;li&gt;Column-level min/max values for each column&lt;/li&gt;
&lt;li&gt;Null counts and NaN counts per column&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The engine evaluates query predicates against these per-file statistics. A query filtering on &lt;code&gt;amount &amp;gt; 500&lt;/code&gt; can skip every file whose &lt;code&gt;amount&lt;/code&gt; column has a maximum value below 500. A query filtering on &lt;code&gt;status = &apos;shipped&apos;&lt;/code&gt; can skip files where the min and max of the &lt;code&gt;status&lt;/code&gt; column are both &lt;code&gt;&apos;pending&apos;&lt;/code&gt; (alphabetically before &apos;shipped&apos; in some encodings, though string pruning depends on sort order).&lt;/p&gt;
&lt;h3&gt;Stage 4: Parquet Internal Pruning&lt;/h3&gt;
&lt;p&gt;After Iceberg&apos;s metadata has identified the relevant files, the engine reads each Parquet file&apos;s footer. Parquet stores its own row-group-level min/max statistics. The engine can skip individual row groups within a file if their statistics exclude the query&apos;s filter values.&lt;/p&gt;
&lt;p&gt;If bloom filters are configured (available in Iceberg v2+), the engine can also check probabilistic membership tests for equality filters, skipping row groups where the bloom filter says the value definitely does not exist.&lt;/p&gt;
&lt;h2&gt;A Concrete Example&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/pruning-example.png&quot; alt=&quot;Three layers of data skipping showing partition pruning, file pruning, and the final result&quot;&gt;&lt;/p&gt;
&lt;p&gt;Consider a table &lt;code&gt;orders&lt;/code&gt; partitioned by month with 12 months of data, 20 files per month (240 total files):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM orders
WHERE order_date = &apos;2024-03-15&apos;
  AND amount &amp;gt; 500
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Manifest list pruning&lt;/strong&gt;: The engine checks partition summaries. 11 of 12 monthly manifests have date ranges that do not include March 2024. They are skipped. Only the March manifest is read.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File pruning&lt;/strong&gt;: The March manifest contains 20 file entries. The engine checks each file&apos;s &lt;code&gt;amount&lt;/code&gt; column statistics. 15 files have &lt;code&gt;max(amount) &amp;lt; 500&lt;/code&gt;, so they cannot contain any rows matching &lt;code&gt;amount &amp;gt; 500&lt;/code&gt;. They are skipped. 5 files remain.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 5 out of 240 files are scanned. The engine eliminated 98% of I/O through metadata alone.&lt;/p&gt;
&lt;h2&gt;What Makes Statistics Effective&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/per-file-statistics.png&quot; alt=&quot;Per-file statistics tracked in Iceberg manifest entries&quot;&gt;&lt;/p&gt;
&lt;p&gt;The effectiveness of file skipping depends entirely on how tight the min/max ranges are per file. Two factors determine this:&lt;/p&gt;
&lt;h3&gt;Sort Order&lt;/h3&gt;
&lt;p&gt;If the &lt;code&gt;amount&lt;/code&gt; column is sorted within each file (or approximately sorted through &lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/&quot;&gt;clustering&lt;/a&gt;), each file contains a narrow range of values. File 1 might have &lt;code&gt;amount&lt;/code&gt; from 10 to 200, File 2 from 200 to 400, and so on. A filter on &lt;code&gt;amount &amp;gt; 500&lt;/code&gt; can skip the first several files completely.&lt;/p&gt;
&lt;p&gt;If the column is randomly distributed, every file has a range of roughly &lt;code&gt;min(amount)&lt;/code&gt; to &lt;code&gt;max(amount)&lt;/code&gt; across the entire dataset. No file can be skipped because every file&apos;s range overlaps every filter. Sort order turns file skipping from theoretical to practical.&lt;/p&gt;
&lt;p&gt;Iceberg supports declaring a &lt;a href=&quot;https://iceberg.apache.org/spec/#sorting&quot;&gt;sort order&lt;/a&gt; at the table level. When engines compact data (rewrite files), they can apply this sort order to produce files with tight column ranges. Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/blog/table-optimization-in-dremio/&quot;&gt;automatic table optimization&lt;/a&gt; handles this without manual intervention for tables managed by Open Catalog.&lt;/p&gt;
&lt;h3&gt;File Size and Count&lt;/h3&gt;
&lt;p&gt;Smaller files mean tighter statistics per file but more manifest entries to manage. Larger files reduce metadata overhead but produce wider min/max ranges (less effective pruning). The typical recommendation is 128 MB to 512 MB per file for analytical workloads.&lt;/p&gt;
&lt;p&gt;Too many small files (the &amp;quot;small file problem&amp;quot;) bloat manifests and slow down planning. Regular &lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/&quot;&gt;compaction&lt;/a&gt; merges small files into optimally-sized ones while preserving or improving sort order.&lt;/p&gt;
&lt;h2&gt;Beyond Min/Max: Other Statistics&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s spec supports several statistical measures per column per file:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Statistic&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Pruning Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Min/Max values&lt;/td&gt;
&lt;td&gt;Range-based filtering&lt;/td&gt;
&lt;td&gt;High (if sorted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Null count&lt;/td&gt;
&lt;td&gt;&lt;code&gt;IS NOT NULL&lt;/code&gt; filters&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NaN count&lt;/td&gt;
&lt;td&gt;Float NaN filtering&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Value count&lt;/td&gt;
&lt;td&gt;Row count estimation&lt;/td&gt;
&lt;td&gt;Used by optimizer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distinct count&lt;/td&gt;
&lt;td&gt;Cardinality estimation&lt;/td&gt;
&lt;td&gt;Used by cost-based optimizer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Engines like &lt;a href=&quot;https://www.dremio.com/platform/reflections/&quot;&gt;Dremio&lt;/a&gt; and Spark use the value counts and distinct counts for cost-based optimization decisions (choosing join strategies, selecting scan parallelism) even when they do not directly prune files.&lt;/p&gt;
&lt;h2&gt;Metadata Caching&lt;/h2&gt;
&lt;p&gt;Reading metadata from object storage on every query adds latency. Production engines cache metadata aggressively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metadata file cache&lt;/strong&gt;: The &lt;code&gt;metadata.json&lt;/code&gt; and manifest list are typically cached in memory. They change only when the table is updated.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest cache&lt;/strong&gt;: Manifest files are immutable (they are never modified, only replaced). Once read, they can be cached indefinitely until they are no longer referenced by any snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parquet footer cache&lt;/strong&gt;: Since Parquet files are immutable, their footers (which contain row-group statistics and schema) can be cached permanently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s &lt;a href=&quot;https://www.dremio.com/platform/reflections/&quot;&gt;Columnar Cloud Cache (C3)&lt;/a&gt; caches both metadata and data on local NVMe drives at executor nodes, turning cloud storage latency into local-disk speed for frequently-accessed tables.&lt;/p&gt;
&lt;h2&gt;When Metadata Is Not Enough&lt;/h2&gt;
&lt;p&gt;Metadata-driven pruning has limits. If a filter column is not in the partition spec and the data is not sorted by that column, min/max ranges will overlap across all files and no pruning occurs. In these cases:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Add the column to the sort order&lt;/strong&gt; and compact the table. This is the most effective fix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consider partition evolution&lt;/strong&gt; (covered in &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Part 4&lt;/a&gt;) to add a partition transform on the column.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable bloom filters&lt;/strong&gt; for high-cardinality equality filters like user IDs or transaction IDs.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The metadata is only as good as the physical organization of the data. Well-organized tables skip 95%+ of I/O. Poorly organized tables with random data distribution skip nothing, and the metadata overhead becomes pure cost with no benefit.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How Databases Organize Data on Disk: Pages, Blocks, and File Formats</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-03/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-03/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:02:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 3 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Part 2&lt;/a&gt; covered row vs. column storage layouts. This article goes one level deeper: how data is physically structured within files, and what metadata accompanies it to make reads efficient.&lt;/p&gt;
&lt;h2&gt;Three Ways to Organize Data in Files&lt;/h2&gt;
&lt;p&gt;Every database faces the same question: when new data arrives, where does it go? The answer determines how fast writes are and how much work reads require.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/write-read-spectrum.png&quot; alt=&quot;The write speed vs read efficiency spectrum showing heap files, LSM trees, B-trees, and sorted files&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Heap Files: Fast Writes, Slow Reads&lt;/h3&gt;
&lt;p&gt;A heap file is an unordered collection of pages. New records go wherever there is free space. No sorting, no ordering, no extra metadata to maintain.&lt;/p&gt;
&lt;p&gt;PostgreSQL uses heap files as its primary storage. Writes are O(1) because the engine just appends to the next available slot. The tradeoff: any query without an index requires a full sequential scan of every page. For a 100 GB table, that means reading 100 GB.&lt;/p&gt;
&lt;h3&gt;Sorted Files: Fast Reads, Slow Writes&lt;/h3&gt;
&lt;p&gt;The opposite extreme. Records are physically ordered by a key. Range scans become sequential I/O (the fastest type of disk access). Binary search finds any record in O(log n) reads.&lt;/p&gt;
&lt;p&gt;The tradeoff: inserting a record into the middle requires shifting everything after it. This makes writes expensive. Few production systems use pure sorted files, but the concept appears in B-trees (which maintain sorted order in a tree structure with efficient insertions) and in compacted LSM tree levels.&lt;/p&gt;
&lt;h3&gt;LSM Trees: A Write-Optimized Compromise&lt;/h3&gt;
&lt;p&gt;Log-Structured Merge-Trees convert random writes into sequential writes. The write path works in three stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Buffer&lt;/strong&gt;: New records go to an in-memory sorted structure (the memtable). Writes are fast because memory is orders of magnitude faster than disk.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flush&lt;/strong&gt;: When the memtable reaches a size threshold, it is written to disk as an immutable sorted file (SSTable).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compact&lt;/strong&gt;: Background processes merge multiple SSTables into fewer, larger sorted files, removing duplicates and deleted records.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;RocksDB, LevelDB, Cassandra, HBase, and ScyllaDB all use LSM trees. The tradeoff: reads may need to check the memtable plus multiple levels of SSTables. Bloom filters (probabilistic structures that quickly answer &amp;quot;is this key possibly in this file?&amp;quot;) mitigate the read amplification.&lt;/p&gt;
&lt;h2&gt;Open File Formats: Parquet, ORC, and Avro&lt;/h2&gt;
&lt;p&gt;Beyond the engine&apos;s internal file structures, the data ecosystem has standardized on open file formats that separate the storage format from the engine. This is what makes data lakehouses possible: multiple engines (Dremio, Spark, DuckDB, Trino, Snowflake) can all read the same data files.&lt;/p&gt;
&lt;h3&gt;Apache Parquet&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://parquet.apache.org/docs/&quot;&gt;Parquet&lt;/a&gt; is the dominant columnar file format for analytics. Understanding its internal structure explains why analytical queries on Parquet files are fast.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/parquet-file-anatomy.png&quot; alt=&quot;Anatomy of a Parquet file showing row groups, column chunks, pages, and the footer with schema and statistics&quot;&gt;&lt;/p&gt;
&lt;p&gt;A Parquet file is organized in layers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row groups&lt;/strong&gt; (default ~128 MB): Each row group contains a horizontal slice of the data (e.g., 1 million rows). This is the unit of parallelism for readers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column chunks&lt;/strong&gt;: Within a row group, each column is stored as a separate contiguous block. This is what enables columnar I/O savings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pages&lt;/strong&gt; (default ~1 MB): Each column chunk is divided into pages, the smallest unit of compression and encoding. Each page uses a single encoding (dictionary, RLE, delta, etc.) and compression codec (Snappy, ZSTD, LZ4, GZIP).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Footer&lt;/strong&gt;: The file footer contains the schema, the location of every row group and column chunk, and column statistics (min value, max value, null count) for each column chunk.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The footer is the key to performance. The engine reads it first, then uses the statistics to decide which row groups and column chunks to read. Everything else can be skipped.&lt;/p&gt;
&lt;h3&gt;Apache ORC&lt;/h3&gt;
&lt;p&gt;ORC (Optimized Row Columnar) serves a similar purpose to Parquet but originated in the Hive ecosystem. Its structure uses &amp;quot;stripes&amp;quot; instead of row groups (default ~250 MB) and includes built-in lightweight indexes: bloom filters per stripe and min/max statistics per row index entry (typically every 10K rows, finer-grained than Parquet&apos;s per-row-group stats).&lt;/p&gt;
&lt;h3&gt;Apache Avro&lt;/h3&gt;
&lt;p&gt;Avro is a row-oriented format with embedded schema. It supports schema evolution (readers handle files written with older or newer schemas). Avro is common for write-heavy pipelines and streaming (Kafka serialization) but is not optimized for analytical reads because its row orientation forces reading unnecessary columns. Dremio, Spark, and other engines can read Avro files but typically convert to Parquet for analytical workloads.&lt;/p&gt;
&lt;h2&gt;Metadata That Enables Skipping&lt;/h2&gt;
&lt;p&gt;The biggest performance wins in analytical engines do not come from reading data faster. They come from reading less data. File-level metadata is what makes this possible.&lt;/p&gt;
&lt;h3&gt;Column Statistics (Min/Max)&lt;/h3&gt;
&lt;p&gt;Every Parquet row group and ORC stripe stores the minimum and maximum value for each column. The query engine uses these statistics to skip entire blocks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/predicate-pushdown-min-max.png&quot; alt=&quot;How min/max statistics allow the engine to skip irrelevant row groups when filtering by price&quot;&gt;&lt;/p&gt;
&lt;p&gt;Consider &lt;code&gt;SELECT * FROM orders WHERE price &amp;gt; 100&lt;/code&gt; on a file with three row groups:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Row Group&lt;/th&gt;
&lt;th&gt;Price Min&lt;/th&gt;
&lt;th&gt;Price Max&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Skip (max 50 &amp;lt; 100)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;td&gt;Scan (range overlaps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;Scan (all values qualify)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Row Group 1 is never read. The engine saved 33% of the I/O with a single metadata comparison. On real datasets with many row groups and selective filters, these savings routinely exceed 90%.&lt;/p&gt;
&lt;p&gt;Dremio, Snowflake, DuckDB, Spark, ClickHouse, and Trino all use this technique. Table formats like Apache Iceberg take it further by storing per-file statistics in manifest files, enabling file-level pruning before any individual file is opened.&lt;/p&gt;
&lt;h3&gt;Bloom Filters&lt;/h3&gt;
&lt;p&gt;When the filter is an equality check (&lt;code&gt;WHERE customer_id = &apos;abc123&apos;&lt;/code&gt;) rather than a range, min/max stats are less useful (the value could be anywhere within the range). Bloom filters solve this: a compact probabilistic structure that answers &amp;quot;is this value possibly in this block?&amp;quot; with no false negatives.&lt;/p&gt;
&lt;p&gt;A bloom filter of a few KB can represent membership for millions of distinct values. Parquet supports optional per-column bloom filters, and ORC includes them per stripe.&lt;/p&gt;
&lt;h3&gt;Partition Metadata&lt;/h3&gt;
&lt;p&gt;Beyond file-internal metadata, many systems organize data into directories by partition key (e.g., &lt;code&gt;year=2024/month=03/day=15/&lt;/code&gt;). A query with &lt;code&gt;WHERE year = 2024&lt;/code&gt; skips all other year directories entirely.&lt;/p&gt;
&lt;p&gt;Apache Iceberg improves on directory-based partitioning with hidden partitioning: partition values are computed from data columns and stored in manifest metadata, enabling partition pruning without requiring users to know the physical layout.&lt;/p&gt;
&lt;h2&gt;The Tradeoff: Write-Time Work vs. Read-Time Work&lt;/h2&gt;
&lt;p&gt;Every organizational choice above is a variation of the same tradeoff: do more work when writing data (sort it, compute statistics, build indexes, organize into optimal row groups) to do less work when reading it.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Write-Time Cost&lt;/th&gt;
&lt;th&gt;Read-Time Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Heap file&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None (full scan)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sort data&lt;/td&gt;
&lt;td&gt;High (maintain order)&lt;/td&gt;
&lt;td&gt;Binary search, sequential scans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute min/max stats&lt;/td&gt;
&lt;td&gt;Low (aggregate per block)&lt;/td&gt;
&lt;td&gt;Skip irrelevant blocks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build bloom filters&lt;/td&gt;
&lt;td&gt;Moderate (hash computation)&lt;/td&gt;
&lt;td&gt;Skip blocks for equality filters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimize row group sizes&lt;/td&gt;
&lt;td&gt;Moderate (buffer before flush)&lt;/td&gt;
&lt;td&gt;Better parallelism, less overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Systems that ingest data continuously (streaming, high-frequency writes) tend to minimize write-time work and rely on background processes (compaction, optimization jobs) to reorganize data for reads. Dremio, for example, automates table maintenance including compaction and clustering to keep Iceberg tables optimized without manual intervention.&lt;/p&gt;
&lt;p&gt;Systems that load data in batches (nightly ETL, periodic imports) can afford to invest heavily in sorting and statistics at write time because the write frequency is low.&lt;/p&gt;
&lt;p&gt;The right choice depends on your ingest pattern. There is no single answer.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Metadata Structure of Modern Table Formats</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:01:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 2 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Part 1&lt;/a&gt; covered why table formats exist. This article breaks down exactly how each format organizes its metadata.&lt;/p&gt;
&lt;p&gt;The metadata structure of a table format determines everything: how fast queries start planning, how efficiently concurrent writes are handled, how schema changes propagate, and how much overhead accumulates over time. Two formats can both claim &amp;quot;ACID support&amp;quot; and &amp;quot;time travel&amp;quot; while having fundamentally different mechanisms under the hood.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Metadata Tree&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/iceberg-metadata-tree.png&quot; alt=&quot;Iceberg&apos;s three-layer metadata architecture from catalog to metadata.json to manifest list to manifest files to data files&quot;&gt;&lt;/p&gt;
&lt;p&gt;Iceberg organizes metadata into a tree with four levels. Each level adds specificity and enables pruning at query planning time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 1: Catalog pointer.&lt;/strong&gt; The catalog (a REST catalog, &lt;a href=&quot;https://www.dremio.com/platform/open-catalog/&quot;&gt;Dremio Open Catalog&lt;/a&gt;, AWS Glue, or Hive Metastore) stores a pointer to the current &lt;code&gt;metadata.json&lt;/code&gt; file. This pointer is the single source of truth for the table&apos;s current state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 2: Metadata file (&lt;code&gt;metadata.json&lt;/code&gt;).&lt;/strong&gt; A JSON file containing the table&apos;s schema (with column IDs), partition spec, sort order, table properties, and a list of snapshots. Each snapshot represents a complete, immutable version of the table. When the table is updated, a new &lt;code&gt;metadata.json&lt;/code&gt; is created with the new snapshot appended to the list.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 3: Manifest list (Avro).&lt;/strong&gt; Each snapshot points to exactly one manifest list. The manifest list is a table of contents: it lists all the manifest files that make up this snapshot and includes partition-level summary statistics for each manifest. These summaries let the query planner skip entire manifests that cannot contain data matching the query filter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 4: Manifest files (Avro).&lt;/strong&gt; Each manifest file tracks a set of data files and delete files. For each file, the manifest stores the file path, file size, row count, partition values, and column-level statistics (min value, max value, null count, NaN count, distinct count). These per-file statistics enable file-level pruning during query planning.&lt;/p&gt;
&lt;p&gt;The key insight is that each level progressively narrows the search space. A query engine using &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-metadata-for-performance/&quot;&gt;Dremio&lt;/a&gt; or Spark reads the catalog pointer (1 request), loads the metadata file (1 read), checks the manifest list to skip irrelevant manifests (1 read, many skips), then reads only the relevant manifests to find the actual data files to scan. For a petabyte table, this can reduce planning from minutes of directory listing to milliseconds of metadata traversal.&lt;/p&gt;
&lt;h2&gt;Delta Lake: The Sequential Transaction Log&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/delta-lake-transaction-log.png&quot; alt=&quot;Delta Lake&apos;s transaction log structure with JSON commits, Parquet checkpoints, and the reader process&quot;&gt;&lt;/p&gt;
&lt;p&gt;Delta Lake uses a simpler, linear structure. All metadata lives in the &lt;code&gt;_delta_log/&lt;/code&gt; directory alongside the data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;JSON commit files&lt;/strong&gt; (&lt;code&gt;000001.json&lt;/code&gt;, &lt;code&gt;000002.json&lt;/code&gt;, ...) record each transaction as a set of actions: &lt;code&gt;add&lt;/code&gt; (a new data file), &lt;code&gt;remove&lt;/code&gt; (a file marked for deletion), &lt;code&gt;metaData&lt;/code&gt; (schema or property change), and &lt;code&gt;protocol&lt;/code&gt; (version requirements). Each commit file is sequentially numbered.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet checkpoint files&lt;/strong&gt; are created every 10 commits (by default). A checkpoint is a Parquet file that summarizes the cumulative state of the table at that version, essentially a snapshot of all currently-active &lt;code&gt;add&lt;/code&gt; actions. This prevents readers from having to replay hundreds of small JSON files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;_last_checkpoint&lt;/code&gt;&lt;/strong&gt; is a small file pointing to the most recent checkpoint. The read process is: find the latest checkpoint, load it, then replay any JSON commits after it.&lt;/p&gt;
&lt;p&gt;The tradeoff: Delta&apos;s log is simple and easy to reason about, but it does not have the multi-level pruning that Iceberg&apos;s manifest tree provides. File-level statistics exist in the add actions but are not organized hierarchically. For very large tables (millions of files), the planning phase can be slower because there is no intermediate pruning layer equivalent to Iceberg&apos;s manifest list.&lt;/p&gt;
&lt;h2&gt;Apache Hudi: The Timeline&lt;/h2&gt;
&lt;p&gt;Hudi stores metadata in the &lt;code&gt;.hoodie/&lt;/code&gt; directory as a sequence of &amp;quot;instants&amp;quot; on a timeline. Each instant represents an operation (commit, compaction, rollback, clean) and transitions through three states: &lt;code&gt;REQUESTED&lt;/code&gt;, &lt;code&gt;INFLIGHT&lt;/code&gt;, and &lt;code&gt;COMPLETED&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The timeline is split into two parts:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Active timeline&lt;/strong&gt; contains recent instants that are needed for current read and write operations. The file naming pattern is &lt;code&gt;&amp;lt;timestamp&amp;gt;.&amp;lt;action_type&amp;gt;.&amp;lt;state&amp;gt;&lt;/code&gt;. For example, &lt;code&gt;20250429010500.commit.completed&lt;/code&gt; indicates a completed write operation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Archived timeline&lt;/strong&gt; contains older instants that have been moved to &lt;code&gt;.hoodie/archived/&lt;/code&gt; to keep the active timeline lean. Hudi 1.0 introduced an LSM-based timeline that compacts archived instants into Parquet files for efficient long-term storage.&lt;/p&gt;
&lt;p&gt;Hudi&apos;s timeline tracks more granular operation types than other formats: &lt;code&gt;commit&lt;/code&gt; (COW write), &lt;code&gt;delta_commit&lt;/code&gt; (MOR write), &lt;code&gt;compaction&lt;/code&gt;, &lt;code&gt;clean&lt;/code&gt; (garbage collection), &lt;code&gt;rollback&lt;/code&gt;, &lt;code&gt;savepoint&lt;/code&gt;, and &lt;code&gt;replace&lt;/code&gt; (clustering). This granularity reflects Hudi&apos;s focus on complex write patterns like CDC pipelines.&lt;/p&gt;
&lt;h2&gt;Apache Paimon: Snapshots and LSM Trees&lt;/h2&gt;
&lt;p&gt;Paimon&apos;s metadata is organized around snapshots and buckets. Each partition is divided into a fixed number of buckets, and each bucket contains an independent LSM (Log-Structured Merge) tree.&lt;/p&gt;
&lt;p&gt;The snapshot metadata tracks which data files and changelog files belong to each bucket at each point in time. Inside each bucket, the LSM tree structure contains multiple &amp;quot;sorted runs&amp;quot; (levels) of Parquet files. When data is written, it lands in level 0 as a small sorted file. Background compaction merges small files into larger ones at higher levels.&lt;/p&gt;
&lt;p&gt;This is fundamentally different from the other formats because Paimon&apos;s metadata structure is designed for continuous streaming writes rather than batch commits. The LSM tree handles high-frequency inserts and updates efficiently by buffering writes in memory and flushing them as sorted runs.&lt;/p&gt;
&lt;h2&gt;DuckLake: SQL Database as Metadata&lt;/h2&gt;
&lt;p&gt;DuckLake takes the most radical departure. Instead of storing metadata as files in object storage, all metadata lives in a traditional SQL database (PostgreSQL, MySQL, SQLite, or DuckDB itself).&lt;/p&gt;
&lt;p&gt;The metadata database contains tables for: schemas, snapshots, data files, column statistics, and table properties. When a query engine needs to plan a query, it issues a single SQL query against the metadata database instead of reading multiple metadata files from object storage.&lt;/p&gt;
&lt;p&gt;The tradeoff is a dependency on a running database process for metadata management. The benefit is dramatically simpler metadata access patterns and the ability to use SQL for metadata operations like listing snapshots, finding files, and checking statistics.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Comparison&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/metadata-architecture-comparison.png&quot; alt=&quot;Five approaches to table metadata from file-based to database-backed&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Iceberg&lt;/th&gt;
&lt;th&gt;Delta Lake&lt;/th&gt;
&lt;th&gt;Hudi&lt;/th&gt;
&lt;th&gt;Paimon&lt;/th&gt;
&lt;th&gt;DuckLake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JSON + Avro files&lt;/td&gt;
&lt;td&gt;JSON + Parquet files&lt;/td&gt;
&lt;td&gt;Avro instant files&lt;/td&gt;
&lt;td&gt;Snapshot + LSM files&lt;/td&gt;
&lt;td&gt;SQL database tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Object storage&lt;/td&gt;
&lt;td&gt;&lt;code&gt;_delta_log/&lt;/code&gt; directory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.hoodie/&lt;/code&gt; directory&lt;/td&gt;
&lt;td&gt;Table directory&lt;/td&gt;
&lt;td&gt;External database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-level pruning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (manifest list + manifests)&lt;/td&gt;
&lt;td&gt;No (flat file list)&lt;/td&gt;
&lt;td&gt;Partial (index-based)&lt;/td&gt;
&lt;td&gt;No (bucket-level)&lt;/td&gt;
&lt;td&gt;Via SQL queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Planning overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (tree traversal)&lt;/td&gt;
&lt;td&gt;Moderate (checkpoint + replay)&lt;/td&gt;
&lt;td&gt;Moderate (timeline scan)&lt;/td&gt;
&lt;td&gt;Low (snapshot lookup)&lt;/td&gt;
&lt;td&gt;Lowest (single SQL query)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata growth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Controlled (manifest reuse)&lt;/td&gt;
&lt;td&gt;Requires checkpointing&lt;/td&gt;
&lt;td&gt;Requires archiving&lt;/td&gt;
&lt;td&gt;Requires compaction&lt;/td&gt;
&lt;td&gt;Database manages it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engine independence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (spec-defined)&lt;/td&gt;
&lt;td&gt;Moderate (Spark-oriented)&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Low (Flink-oriented)&lt;/td&gt;
&lt;td&gt;Low (DuckDB-oriented)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For teams building on multiple engines, Iceberg&apos;s metadata structure provides the best combination of planning efficiency and engine independence. &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-delta-lake-apache-hudi-a-comparison/&quot;&gt;Dremio&lt;/a&gt; uses Iceberg&apos;s metadata tree to achieve fast query planning even on tables with millions of files, and its &lt;a href=&quot;https://www.dremio.com/platform/reflections/&quot;&gt;Columnar Cloud Cache&lt;/a&gt; caches frequently-accessed metadata locally to further reduce planning latency.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Part 3&lt;/a&gt; covers how query engines use Iceberg&apos;s metadata specifically for performance optimization.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Row vs. Column: How Storage Layout Shapes Everything</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-02/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-02/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:01:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 2 of a 10-part series on query engine design. &lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;Part 1 (Overview)&lt;/a&gt; introduced the nine decisions every engine must make. This article covers the first and most fundamental: how bytes are arranged on disk.&lt;/p&gt;
&lt;h2&gt;How Row Storage Works&lt;/h2&gt;
&lt;p&gt;A row store keeps all fields of a record physically together on a disk page. A page is typically 4KB to 16KB. Each page holds multiple complete &amp;quot;tuples&amp;quot; (records). When you read one page, you get every field for every record on that page.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/row-vs-column-layout.png&quot; alt=&quot;How the same 5 records look in row-oriented layout versus column-oriented layout on disk&quot;&gt;&lt;/p&gt;
&lt;p&gt;This layout is optimized for transactional workloads. Looking up a customer by ID? One page read gives you every field: name, email, address, balance, status. Inserting a new order? One write puts the entire record in one place. Updating a single field? The engine finds the tuple and modifies it in place.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.postgresql.org/docs/current/storage.html&quot;&gt;PostgreSQL&lt;/a&gt; stores rows as heap tuples with a header containing transaction visibility info and a null bitmap. MySQL/InnoDB organizes rows in a clustered B-tree indexed by primary key. Oracle and SQL Server both default to row-based storage.&lt;/p&gt;
&lt;p&gt;The weakness shows up with analytical queries. If your table has 50 columns and your query needs 3 of them, a row store still reads all 50 for every row. The other 47 columns ride along for free, wasting I/O bandwidth and polluting your CPU cache.&lt;/p&gt;
&lt;h2&gt;How Column Storage Works&lt;/h2&gt;
&lt;p&gt;A column store flips the layout. Instead of keeping all fields of a record together, it keeps all values for a single field together. Every &lt;code&gt;price&lt;/code&gt; value is stored contiguously. Every &lt;code&gt;status&lt;/code&gt; value is stored contiguously. And so on.&lt;/p&gt;
&lt;p&gt;The data is typically organized in &amp;quot;row groups&amp;quot; (Parquet calls them this, ORC calls them &amp;quot;stripes&amp;quot;), each containing 100K to 1M rows. Within each row group, each column is stored as a separate &amp;quot;column chunk&amp;quot; with its own compression and encoding. Values at the same position across column chunks belong to the same logical record.&lt;/p&gt;
&lt;p&gt;This layout is optimized for analytical workloads. When a query computes &lt;code&gt;AVG(price) WHERE status = &apos;shipped&apos;&lt;/code&gt;, the engine reads only the &lt;code&gt;price&lt;/code&gt; and &lt;code&gt;status&lt;/code&gt; columns. The other 48 columns are never touched.&lt;/p&gt;
&lt;p&gt;Systems like &lt;a href=&quot;https://duckdb.org/docs/internals/storage&quot;&gt;DuckDB&lt;/a&gt;, ClickHouse, Snowflake, Dremio, Redshift, and BigQuery all use columnar storage as their primary layout. Apache Parquet and ORC are open columnar file formats used across the data ecosystem.&lt;/p&gt;
&lt;h2&gt;The I/O Math&lt;/h2&gt;
&lt;p&gt;The savings from columnar storage scale with table width. Consider a concrete example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Table&lt;/strong&gt;: 50 columns, 1 billion rows, 100 bytes per row average&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total data&lt;/strong&gt;: 100 GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query&lt;/strong&gt;: &lt;code&gt;SELECT AVG(price) FROM orders WHERE status = &apos;shipped&apos;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Columns needed&lt;/strong&gt;: 2 (price + status), approximately 4 GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/io-comparison-row-vs-column.png&quot; alt=&quot;I/O comparison showing row store reading 100 GB versus column store reading only 4 GB for the same analytical query&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage Layout&lt;/th&gt;
&lt;th&gt;Data Read&lt;/th&gt;
&lt;th&gt;Percentage of Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Row store&lt;/td&gt;
&lt;td&gt;100 GB&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Column store&lt;/td&gt;
&lt;td&gt;4 GB&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That is a 25x reduction in I/O. For a table with 200 columns (common in analytics), the ratio gets even more dramatic.&lt;/p&gt;
&lt;p&gt;The tradeoff goes the other direction for point lookups. Fetching one complete record from a column store requires reading from every column file: 50 separate reads for a 50-column table. A row store does it in one.&lt;/p&gt;
&lt;h2&gt;Why Columnar Compression Is So Much Better&lt;/h2&gt;
&lt;p&gt;Uniform data within a column enables specialized encoding that mixed-type rows cannot use:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Encoding&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Run-Length (RLE)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sorted columns with repeated values&lt;/td&gt;
&lt;td&gt;Store (value, count) pairs. A column of 1M &amp;quot;USA&amp;quot; values becomes one entry.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dictionary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low-cardinality strings&lt;/td&gt;
&lt;td&gt;Map each unique string to an integer ID. Store the small integers instead.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sorted integers/timestamps&lt;/td&gt;
&lt;td&gt;Store differences between consecutive values. Monotonic sequences shrink to near-zero.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bit-packing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Small integers&lt;/td&gt;
&lt;td&gt;Use the minimum number of bits per value instead of a full 32 or 64 bits.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These techniques routinely achieve 5-10x compression on analytical datasets. Row stores cannot match this because adjacent bytes in a tuple belong to different data types, defeating any type-specific encoding.&lt;/p&gt;
&lt;h2&gt;Late Materialization&lt;/h2&gt;
&lt;p&gt;Column stores gain additional performance by deferring tuple reconstruction until the very end. This technique is called late materialization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Scan the &lt;code&gt;status&lt;/code&gt; column. Produce a selection vector (a bitmap of matching row positions).&lt;/li&gt;
&lt;li&gt;Use that selection vector to read only the matching positions from the &lt;code&gt;price&lt;/code&gt; column.&lt;/li&gt;
&lt;li&gt;Compute &lt;code&gt;AVG(price)&lt;/code&gt; on the filtered values.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At no point did the engine reconstruct a full row. It worked entirely with columnar arrays and position-based selection. This avoids copying irrelevant data and keeps computation in tight, cache-friendly loops that exploit CPU SIMD instructions.&lt;/p&gt;
&lt;p&gt;Dremio uses &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;Apache Arrow&lt;/a&gt; as its native in-memory columnar format, which is specifically designed for this kind of vectorized, late-materialized processing.&lt;/p&gt;
&lt;h2&gt;Hybrid Approaches&lt;/h2&gt;
&lt;p&gt;Not every system picks one side and stays there.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SQL Server&lt;/strong&gt; lets you add nonclustered columnstore indexes to row-based tables. The query optimizer decides which format to use for each query. &lt;strong&gt;Oracle&lt;/strong&gt; offers an In-Memory Column Store (IMCS) that keeps hot data in both row and column format simultaneously in memory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wide-column stores&lt;/strong&gt; like Cassandra and HBase take a different path. They group related columns into &amp;quot;column families.&amp;quot; Within a family, data is stored together (row-like). Across families, storage is separate (column-like). This optimizes for workloads where certain columns are always accessed together.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet&lt;/strong&gt; and &lt;strong&gt;ORC&lt;/strong&gt; use a hybrid layout at the file level: data is divided into row groups (row-like partitioning), and within each row group, each column is stored separately (column-like). This balances the benefits of columnar scanning with practical record reconstruction when needed.&lt;/p&gt;
&lt;h2&gt;Where Real Systems Land&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/storage-format-spectrum.png&quot; alt=&quot;Storage format choices across real systems from row-oriented PostgreSQL to column-oriented DuckDB, ClickHouse, Snowflake, and Dremio&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Storage Format&lt;/th&gt;
&lt;th&gt;Primary Workload&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Row&lt;/td&gt;
&lt;td&gt;OLTP&lt;/td&gt;
&lt;td&gt;Heap tuples, TOAST for large values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL/InnoDB&lt;/td&gt;
&lt;td&gt;Row&lt;/td&gt;
&lt;td&gt;OLTP&lt;/td&gt;
&lt;td&gt;Clustered B-tree by primary key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL Server&lt;/td&gt;
&lt;td&gt;Row + optional column&lt;/td&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;Columnstore indexes for analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oracle&lt;/td&gt;
&lt;td&gt;Row + optional column&lt;/td&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;In-Memory Column Store (IMCS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;Column&lt;/td&gt;
&lt;td&gt;OLAP (embedded)&lt;/td&gt;
&lt;td&gt;Morsel-driven parallelism&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse&lt;/td&gt;
&lt;td&gt;Column&lt;/td&gt;
&lt;td&gt;OLAP (real-time)&lt;/td&gt;
&lt;td&gt;MergeTree engine, sparse indexes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;Column&lt;/td&gt;
&lt;td&gt;Cloud OLAP&lt;/td&gt;
&lt;td&gt;Micro-partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dremio&lt;/td&gt;
&lt;td&gt;Column&lt;/td&gt;
&lt;td&gt;OLAP (lakehouse)&lt;/td&gt;
&lt;td&gt;Arrow in-memory, reads Parquet/Iceberg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redshift&lt;/td&gt;
&lt;td&gt;Column&lt;/td&gt;
&lt;td&gt;Cloud OLAP&lt;/td&gt;
&lt;td&gt;MPP, zone maps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cassandra&lt;/td&gt;
&lt;td&gt;Wide-column&lt;/td&gt;
&lt;td&gt;Write-heavy distributed&lt;/td&gt;
&lt;td&gt;LSM-based, column families&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;When to Choose Which&lt;/h2&gt;
&lt;p&gt;The choice is driven by your dominant access pattern:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mostly point lookups and transactional writes&lt;/strong&gt; (user profiles, order processing, session management): use a row store. PostgreSQL and MySQL are battle-tested here.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mostly analytical scans and aggregations&lt;/strong&gt; (dashboards, reports, data science): use a column store. DuckDB for embedded, ClickHouse or Dremio for distributed, Snowflake or BigQuery for fully managed cloud.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Both workloads on the same data&lt;/strong&gt;: use separate systems for each (the most common production pattern) or a hybrid like SQL Server with columnstore indexes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Trying to force a row store into heavy analytics or a column store into high-frequency transactions will produce consistently poor results. The storage layout is the first domino, and it falls in one direction.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Are Table Formats and Why Were They Needed?</title><link>https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/</guid><description>
## Apache Iceberg Masterclass - Table of Contents

1. [What Are Table Formats and Why Were They Needed?](/posts/2026-04-29-iceberg-masterclass-01/)
2...</description><pubDate>Wed, 29 Apr 2026 12:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Apache Iceberg Masterclass - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;The Metadata Structure of Modern Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-03/&quot;&gt;Performance and Apache Iceberg&apos;s Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-04/&quot;&gt;Partition Evolution: Change Your Partitioning Without Rewriting Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-05/&quot;&gt;Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-06/&quot;&gt;Writing to an Apache Iceberg Table: How Commits and ACID Actually Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-07/&quot;&gt;What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-08/&quot;&gt;When Catalogs Are Embedded in Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-09/&quot;&gt;How Data Lake Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-10/&quot;&gt;Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-11/&quot;&gt;Apache Iceberg Metadata Tables: Querying the Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-12/&quot;&gt;Using Apache Iceberg with Python and MPP Query Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-13/&quot;&gt;Approaches to Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-14/&quot;&gt;Hands-On with Apache Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-15/&quot;&gt;Migrating to Apache Iceberg: Strategies for Every Source System&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is Part 1 of a 15-part &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-01/&quot;&gt;Apache Iceberg Masterclass&lt;/a&gt;. This article covers the fundamental question: what problem do table formats solve, and why does the choice between them matter?&lt;/p&gt;
&lt;p&gt;A data lake without a table format is a collection of files. It has no concept of a transaction, no mechanism to prevent two writers from producing corrupted state, and no efficient way to determine which files belong to the current version of a table. Table formats exist because the gap between &amp;quot;a pile of Parquet files&amp;quot; and &amp;quot;a reliable analytical table&amp;quot; is enormous, and bridging it requires a formal metadata specification.&lt;/p&gt;
&lt;h2&gt;The World Before Table Formats&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/before-after-table-formats.png&quot; alt=&quot;How table formats solved the chaos of raw data lakes with a structured metadata layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;Before table formats, data lakes relied on a simple convention: data was organized into directories in object storage (S3, ADLS, GCS), and the &lt;a href=&quot;https://cwiki.apache.org/confluence/display/hive/design#Design-HiveMetastore&quot;&gt;Hive Metastore&lt;/a&gt; tracked which directories corresponded to which partitions.&lt;/p&gt;
&lt;p&gt;This approach had five critical problems:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No atomic commits.&lt;/strong&gt; If a Spark job wrote 500 new Parquet files and failed after writing 300, readers could see the 300 partial files. There was no mechanism to make all 500 files visible at once or none of them. Cleanup required manual intervention or custom garbage collection scripts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Expensive query planning.&lt;/strong&gt; To determine which files to scan, the engine issued &lt;code&gt;LIST&lt;/code&gt; requests against object storage. S3 returns up to 5,000 objects per request. A table with 100,000 files required 20+ sequential HTTP calls before query execution could even start. At Netflix, query planning for large tables could take minutes just from directory listing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema changes required rewrites.&lt;/strong&gt; Adding a column to a Hive table meant either rewriting every file (expensive) or accepting that old files had a different schema than new files (confusing). Renaming a column was not supported without a full table rewrite because Hive mapped columns by position, not by identity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No time travel.&lt;/strong&gt; Once data was overwritten, the previous version was gone. There was no snapshot history, no ability to roll back a bad write, and no way to reproduce a query result from last Tuesday.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Exposed partitioning.&lt;/strong&gt; Users had to know the physical partition layout. A table partitioned by &lt;code&gt;year&lt;/code&gt; and &lt;code&gt;month&lt;/code&gt; required queries to explicitly filter on those columns using the exact partition column names (&lt;code&gt;WHERE year = 2024 AND month = 3&lt;/code&gt;). If partitioning changed, every downstream query broke.&lt;/p&gt;
&lt;h2&gt;What a Table Format Actually Is&lt;/h2&gt;
&lt;p&gt;A table format is a specification that defines how to organize metadata about data files so that query engines can treat them as reliable, transactional tables. It sits between the query engine and the physical files.&lt;/p&gt;
&lt;p&gt;The core responsibilities of every table format:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;File tracking&lt;/strong&gt;: Maintain an explicit list of which data files belong to the current version of the table, eliminating directory listing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Atomic commits&lt;/strong&gt;: Make all changes to a table visible to readers at once through a single metadata pointer swap&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema management&lt;/strong&gt;: Track the table schema and its evolution history, allowing safe column adds, drops, renames, and reorders&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition management&lt;/strong&gt;: Define how data is partitioned and enable query pruning without exposing the physical layout to users&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot history&lt;/strong&gt;: Maintain a history of table states for time travel, rollback, and auditing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Statistics&lt;/strong&gt;: Store column-level min/max values and other statistics to enable file skipping during query planning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The data files themselves are still standard &lt;a href=&quot;https://parquet.apache.org/&quot;&gt;Parquet&lt;/a&gt; or ORC. The table format adds a metadata layer on top that gives those files the properties of a database table.&lt;/p&gt;
&lt;h2&gt;The Five Table Formats&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/format-evolution-timeline.png&quot; alt=&quot;Timeline showing the evolution from Hive Metastore through Hudi, Iceberg, Delta Lake, Paimon, and DuckLake&quot;&gt;&lt;/p&gt;
&lt;p&gt;Five table formats exist today, each born from a different problem and optimized for a different workload.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Iceberg started at Netflix in 2017, created by Ryan Blue to solve Netflix&apos;s petabyte-scale query planning problems. It uses a three-layer metadata tree: a &lt;code&gt;metadata.json&lt;/code&gt; file points to a manifest list, which points to manifest files, which track individual data files with column-level statistics.&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s defining feature is its &lt;a href=&quot;https://iceberg.apache.org/spec/&quot;&gt;formal specification&lt;/a&gt;. Any engine that follows the spec can read and write Iceberg tables correctly. This makes Iceberg the most engine-neutral format. Spark, Trino, Flink, &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Dremio&lt;/a&gt;, Snowflake, BigQuery, Athena, StarRocks, and DuckDB all support it.&lt;/p&gt;
&lt;p&gt;Iceberg also introduced &lt;a href=&quot;https://www.dremio.com/blog/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;hidden partitioning&lt;/a&gt; and partition evolution, which are covered in depth in Parts 4 and 5 of this series.&lt;/p&gt;
&lt;h3&gt;Delta Lake&lt;/h3&gt;
&lt;p&gt;Delta Lake was created at Databricks in 2019. It stores metadata as a sequential transaction log (&lt;code&gt;_delta_log/&lt;/code&gt;) of JSON and Parquet checkpoint files. Each commit appends a new log entry describing which files were added or removed.&lt;/p&gt;
&lt;p&gt;Delta Lake&apos;s design prioritizes simplicity within the Spark ecosystem. Its strongest features are Liquid Clustering (adaptive data organization that replaces static partitioning) and UniForm (automatic generation of Iceberg-compatible metadata so other engines can read Delta tables as Iceberg).&lt;/p&gt;
&lt;h3&gt;Apache Hudi&lt;/h3&gt;
&lt;p&gt;Hudi originated at Uber in 2016, designed specifically for Change Data Capture (CDC) pipelines that needed to upsert millions of records per hour. Hudi uses a timeline-based metadata architecture where each commit, compaction, and rollback is an &amp;quot;action instant.&amp;quot;&lt;/p&gt;
&lt;p&gt;Hudi offers both Copy-on-Write (rewrite entire files on update) and Merge-on-Read (write deltas and merge at read time) table types, plus record-level indexing for fast point lookups. It excels when your primary workload involves frequent row-level updates and deletes.&lt;/p&gt;
&lt;h3&gt;Apache Paimon&lt;/h3&gt;
&lt;p&gt;Paimon evolved from Flink Table Store at Alibaba and entered Apache incubation in 2023. It uses &lt;a href=&quot;https://en.wikipedia.org/wiki/Log-structured_merge-tree&quot;&gt;LSM-tree&lt;/a&gt; based storage internally, making it the most streaming-native table format.&lt;/p&gt;
&lt;p&gt;Tables in Paimon are divided into partitions and then further into buckets, each containing an independent LSM tree. This structure enables high-throughput streaming writes with millisecond-level latency. Paimon supports multiple merge engines (deduplication, partial update, aggregation) that determine how records with the same primary key are resolved.&lt;/p&gt;
&lt;h3&gt;DuckLake&lt;/h3&gt;
&lt;p&gt;DuckLake is the newest entry, released by DuckDB Labs and MotherDuck in 2025. It takes a fundamentally different approach: instead of storing metadata as files in object storage, DuckLake stores all metadata in a standard SQL database (PostgreSQL, MySQL, SQLite, or DuckDB itself).&lt;/p&gt;
&lt;p&gt;This means a single SQL query resolves all metadata (schema, file list, statistics) instead of the multiple HTTP requests required by file-based metadata formats. The tradeoff is a dependency on a running database for the metadata layer and currently limited engine support (primarily DuckDB).&lt;/p&gt;
&lt;h2&gt;Where Each Format Excels&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-iceberg-masterclass/format-positioning-chart.png&quot; alt=&quot;Positioning chart showing where Iceberg, Delta Lake, Hudi, Paimon, and DuckLake sit on batch vs streaming and single vs multi-engine axes&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Iceberg&lt;/th&gt;
&lt;th&gt;Delta Lake&lt;/th&gt;
&lt;th&gt;Hudi&lt;/th&gt;
&lt;th&gt;Paimon&lt;/th&gt;
&lt;th&gt;DuckLake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File-based tree&lt;/td&gt;
&lt;td&gt;File-based log&lt;/td&gt;
&lt;td&gt;File-based timeline&lt;/td&gt;
&lt;td&gt;File-based LSM&lt;/td&gt;
&lt;td&gt;SQL database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engine support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Broadest&lt;/td&gt;
&lt;td&gt;Good (via UniForm)&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema evolution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;By column ID&lt;/td&gt;
&lt;td&gt;By name&lt;/td&gt;
&lt;td&gt;By version&lt;/td&gt;
&lt;td&gt;By version&lt;/td&gt;
&lt;td&gt;SQL ALTER&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Partition evolution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (unique)&lt;/td&gt;
&lt;td&gt;Liquid Clustering&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Bucket evolution&lt;/td&gt;
&lt;td&gt;SQL-managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming writes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-engine analytics&lt;/td&gt;
&lt;td&gt;Spark/Databricks&lt;/td&gt;
&lt;td&gt;CDC/upserts&lt;/td&gt;
&lt;td&gt;Flink streaming&lt;/td&gt;
&lt;td&gt;Local SQL analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key insight: each format reflects the priorities of the team that built it. Netflix needed multi-engine reads at petabyte scale (Iceberg). Uber needed high-frequency upserts (Hudi). Alibaba needed real-time streaming from Flink (Paimon). Databricks needed Spark-optimized simplicity (Delta). DuckDB Labs wanted SQL-native metadata management (DuckLake).&lt;/p&gt;
&lt;h2&gt;Why Iceberg Has Become the Default&lt;/h2&gt;
&lt;p&gt;Iceberg has achieved the broadest adoption for three reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Specification-first design.&lt;/strong&gt; Iceberg&apos;s &lt;a href=&quot;https://iceberg.apache.org/spec/&quot;&gt;spec&lt;/a&gt; is independent of any engine or vendor. Any team can build a conforming implementation. This created a network effect: more engine support attracted more users, which attracted more engine support.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No engine dependency.&lt;/strong&gt; Unlike Delta Lake&apos;s historical Spark dependency or Paimon&apos;s Flink focus, Iceberg was designed from day one to work across engines. A table written by Spark can be read by &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-delta-lake-apache-hudi-a-comparison/&quot;&gt;Dremio&lt;/a&gt;, Trino, Flink, or Snowflake without conversion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Industry convergence.&lt;/strong&gt; Snowflake, AWS (Athena, EMR), Google (BigQuery), and Databricks (via UniForm) have all adopted Iceberg as an interoperability standard. When the major cloud vendors align on a format, it becomes the safe choice for long-term investments.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That said, Iceberg is not universally superior. Hudi&apos;s record-level indexing makes it faster for point lookups on upsert-heavy tables. Paimon&apos;s LSM-tree architecture handles continuous streaming ingestion with lower latency than Iceberg&apos;s batch-oriented commit model. DuckLake&apos;s SQL-based metadata is simpler for single-engine, local-first analytics.&lt;/p&gt;
&lt;p&gt;The rest of this series focuses on Iceberg because its design decisions and capabilities represent the state of the art for multi-engine analytical lakehouses. &lt;a href=&quot;/posts/2026-04-29-iceberg-masterclass-02/&quot;&gt;Part 2&lt;/a&gt; examines the metadata structures of all five formats in detail.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;p&gt;To learn more about Apache Iceberg and the lakehouse architecture, check out these resources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Free Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpageiceberg&quot;&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/linkpagepolaris&quot;&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced&quot;&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://forms.gle/xdsun6JiRvFY9rB36&quot;&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How Query Engines Think: The Tradeoffs Behind Every Data System</title><link>https://iceberglakehouse.com/posts/2026-04-29-query-engine-01/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-04-29-query-engine-01/</guid><description>
## Query Engine Optimization - Table of Contents

1. [How Query Engines Think: The Tradeoffs Behind Every Data System](/posts/2026-04-29-query-engine...</description><pubDate>Wed, 29 Apr 2026 12:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Query Engine Optimization - Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-01/&quot;&gt;How Query Engines Think: The Tradeoffs Behind Every Data System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-02/&quot;&gt;Row vs. Column: How Storage Layout Shapes Everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-03/&quot;&gt;How Databases Organize Data on Disk: Pages, Blocks, and File Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-04/&quot;&gt;B-Trees, LSM Trees, and the Indexing Tradeoff Spectrum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-05/&quot;&gt;Inside the Query Optimizer: How Engines Pick a Plan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-06/&quot;&gt;Volcano, Vectorized, Compiled: How Engines Execute Your Query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-07/&quot;&gt;Buffer Pools, Caches, and the Memory Hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-08/&quot;&gt;Partitioning, Sharding, and Data Distribution Strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-09/&quot;&gt;Hash, Sort-Merge, Broadcast: How Distributed Joins Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-04-29-query-engine-10/&quot;&gt;Concurrency, Isolation, and MVCC: How Engines Handle Contention&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Every database you have ever used is a collection of deliberate engineering tradeoffs. PostgreSQL is fast at looking up a single customer record but slow at scanning a billion rows for an aggregate. ClickHouse is the opposite. DuckDB runs analytical queries on your laptop at speeds that embarrass some cloud data warehouses, but it is not designed to handle 10,000 concurrent transactional writes per second. Dremio accelerates analytical queries on lakehouse data using Apache Arrow and Iceberg, but it is not a replacement for a transactional OLTP database.&lt;/p&gt;
&lt;p&gt;None of these systems are broken. They are each optimized for a specific set of problems, and that optimization comes at the cost of other problems. Understanding &lt;em&gt;why&lt;/em&gt; they behave differently requires looking at the nine design decisions that every query engine must make.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/query-engine-decision-map.png&quot; alt=&quot;The 9 decisions that shape every query engine from storage layout to concurrency control&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Nine Decisions Every Engine Must Make&lt;/h2&gt;
&lt;p&gt;When engineers build a query engine, they face a series of interconnected choices. Each choice optimizes for one type of workload and creates a weakness for another. Here is the full map:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Storage layout&lt;/strong&gt;: Should records be stored by row or by column?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disk organization&lt;/strong&gt;: How should data be structured within files? What metadata should accompany it?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexing&lt;/strong&gt;: What auxiliary data structures should speed up lookups?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query optimization&lt;/strong&gt;: How should the engine choose between multiple possible execution strategies?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Execution model&lt;/strong&gt;: How should the CPU actually process data through the operator tree?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory management&lt;/strong&gt;: How should the engine use RAM, and what happens when data does not fit?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partitioning&lt;/strong&gt;: How should data be divided across storage units or nodes?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Join algorithms&lt;/strong&gt;: When data for a join lives in different places, how do you bring it together?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Concurrency control&lt;/strong&gt;: How should the engine handle multiple transactions touching the same data?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The rest of this article introduces each area. The nine articles that follow will cover each one in depth.&lt;/p&gt;
&lt;h2&gt;Storage Layout: Row vs. Column&lt;/h2&gt;
&lt;p&gt;The most fundamental decision is how to arrange bytes on disk.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Row stores&lt;/strong&gt; (PostgreSQL, MySQL) keep all fields of a record physically adjacent. When you look up a customer by ID, the engine reads one contiguous block and has every field immediately. Inserts and updates are fast because the entire record lives in one place.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column stores&lt;/strong&gt; (DuckDB, ClickHouse, Dremio, Snowflake) keep all values for a single field stored together. When an analytical query needs the average of one column across a billion rows, the engine reads only that column and ignores the other 49. Compression improves because uniform data types pack tightly. But inserting a single row means writing to every column file separately.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Row Store&lt;/th&gt;
&lt;th&gt;Column Store&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Point lookups&lt;/td&gt;
&lt;td&gt;Fast (one read gets full record)&lt;/td&gt;
&lt;td&gt;Slow (must read from every column)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analytical scans&lt;/td&gt;
&lt;td&gt;Slow (reads unused columns)&lt;/td&gt;
&lt;td&gt;Fast (reads only needed columns)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compression&lt;/td&gt;
&lt;td&gt;Moderate (mixed types)&lt;/td&gt;
&lt;td&gt;High (uniform types, 5-10x better)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transactional writes&lt;/td&gt;
&lt;td&gt;Fast (one write per record)&lt;/td&gt;
&lt;td&gt;Expensive (one write per column)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The choice cascades through everything else: what indexes make sense, how execution works, how memory is used. It is the first domino.&lt;/p&gt;
&lt;h2&gt;How Data Gets Indexed&lt;/h2&gt;
&lt;p&gt;Every index speeds up reads and slows down writes. The question is which tradeoff to accept.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;B-trees&lt;/strong&gt; are the standard for transactional databases. They maintain a balanced tree structure with O(log n) lookups and efficient range scans. PostgreSQL, MySQL, and Oracle all default to B-trees. They handle mixed read/write workloads well, but heavy write volumes cause fragmentation and rebalancing overhead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LSM trees&lt;/strong&gt; (Log-Structured Merge-Trees) are built for write-heavy workloads. They buffer writes in memory and flush them to disk as sorted files, converting random writes into sequential ones. &lt;a href=&quot;https://rocksdb.org/&quot;&gt;RocksDB&lt;/a&gt;, Cassandra, and HBase all use LSM trees. The tradeoff: reads may need to check multiple levels of sorted files before finding the answer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Zone maps&lt;/strong&gt; and &lt;strong&gt;min/max indexes&lt;/strong&gt; are the columnar engine&apos;s answer. Store the minimum and maximum value for each data block. When a query filters on that column, skip every block whose range does not overlap the filter. No write overhead, but only useful for scan-heavy workloads.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/read-write-tradeoff.png&quot; alt=&quot;The read-write tradeoff showing how sorted files with dense indexes optimize reads while LSM trees and heap files optimize writes&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Planning and Executing the Query&lt;/h2&gt;
&lt;p&gt;Once data is stored and indexed, the engine must decide &lt;em&gt;how&lt;/em&gt; to answer your query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query optimizers&lt;/strong&gt; choose between candidate execution plans. A rule-based optimizer applies fixed transformations (push filters down, drop unused columns). A cost-based optimizer estimates the cost of multiple plans using table statistics and picks the cheapest one. Spark&apos;s Adaptive Query Execution goes further: it monitors actual data sizes during execution and changes the plan mid-flight.&lt;/p&gt;
&lt;p&gt;The tradeoff: more planning time can find dramatically better plans, but the planning itself has a cost. For a simple point lookup, an elaborate cost-based search is wasted effort. For a complex 12-table join, skipping cost-based optimization can produce a plan that is 100x slower than necessary.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Execution models&lt;/strong&gt; determine how the CPU processes data through the plan:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Volcano (iterator)&lt;/strong&gt;: Each operator passes one row at a time via &lt;code&gt;Next()&lt;/code&gt; calls. Simple, modular, but millions of virtual function calls waste CPU cycles on large datasets. PostgreSQL uses this model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized&lt;/strong&gt;: Each &lt;code&gt;Next()&lt;/code&gt; call returns a batch of rows (e.g., 1024). Tight inner loops process one column at a time, exploiting CPU SIMD instructions. DuckDB, ClickHouse, and Dremio use this approach.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code generation&lt;/strong&gt;: Fuse multiple operators into a single compiled function. Eliminate operator abstraction entirely. Apache Spark&apos;s Tungsten engine uses whole-stage code generation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Memory, Distribution, and Concurrency&lt;/h2&gt;
&lt;p&gt;The remaining decisions shape how the engine handles scale and contention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Memory management&lt;/strong&gt; is a balancing act. More RAM for caching means faster repeated reads. More RAM for sort buffers and hash tables means faster query processing. The engine cannot maximize both. Traditional databases use buffer pools that pin frequently accessed pages in memory. Analytical engines use column-level caches and result caches.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partitioning&lt;/strong&gt; determines how data is divided. Hash partitioning distributes data evenly but makes range scans expensive. Range partitioning makes range scans fast but creates hotspots when keys are skewed. The optimizer&apos;s ability to skip irrelevant partitions (partition pruning) is often the single biggest performance win in large-scale systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Join algorithms&lt;/strong&gt; in distributed systems face a fundamental problem: data for a join may live on different nodes. A shuffle join re-distributes both tables by the join key across the network. A broadcast join copies the small table to every node. A co-located join requires that both tables were pre-partitioned by the same key, avoiding data movement entirely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Concurrency control&lt;/strong&gt; decides what happens when multiple transactions touch the same data. Two-phase locking (2PL) is safe but limits throughput because readers block writers. MVCC (Multi-Version Concurrency Control) keeps multiple row versions so readers see a consistent snapshot without blocking writers. Most modern systems, from &lt;a href=&quot;https://www.postgresql.org/docs/current/mvcc.html&quot;&gt;PostgreSQL&lt;/a&gt; to Dremio and Snowflake, use MVCC or snapshot-based isolation.&lt;/p&gt;
&lt;h2&gt;The OLTP-OLAP Spectrum&lt;/h2&gt;
&lt;p&gt;All nine decisions converge on one fundamental axis: is this system built for transactions or analytics?&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/query-engine-optimization/oltp-olap-spectrum.png&quot; alt=&quot;Where real-world database systems land on the OLTP to OLAP design spectrum&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;OLTP Optimization&lt;/th&gt;
&lt;th&gt;OLAP Optimization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Row-oriented&lt;/td&gt;
&lt;td&gt;Column-oriented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Indexing&lt;/td&gt;
&lt;td&gt;B-trees&lt;/td&gt;
&lt;td&gt;Zone maps, bloom filters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access pattern&lt;/td&gt;
&lt;td&gt;Point lookups, small updates&lt;/td&gt;
&lt;td&gt;Full scans, aggregations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Execution&lt;/td&gt;
&lt;td&gt;Volcano (row-at-a-time)&lt;/td&gt;
&lt;td&gt;Vectorized / compiled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency&lt;/td&gt;
&lt;td&gt;High (MVCC, fine-grained locks)&lt;/td&gt;
&lt;td&gt;Low (batch loads, snapshots)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distribution&lt;/td&gt;
&lt;td&gt;Sharding by primary key&lt;/td&gt;
&lt;td&gt;MPP with shuffle joins&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;No single system dominates both ends. Systems that attempt Hybrid Transactional/Analytical Processing (HTAP) make explicit compromises, typically maintaining two internal storage formats and routing queries to whichever is more appropriate.&lt;/p&gt;
&lt;h2&gt;Where This Series Goes Next&lt;/h2&gt;
&lt;p&gt;This overview is the map. The nine articles that follow are the territory:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Row vs. Column Storage&lt;/strong&gt;: how physical byte layout determines which queries are fast&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Organization on Disk&lt;/strong&gt;: pages, blocks, file formats, and metadata&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexing Strategies&lt;/strong&gt;: B-trees, LSM trees, bitmap indexes, and bloom filters&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query Optimizer Internals&lt;/strong&gt;: cost-based planning, cardinality estimation, and adaptive execution&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Execution Models&lt;/strong&gt;: Volcano, vectorized, compiled, and morsel-driven parallelism&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Management and Caching&lt;/strong&gt;: buffer pools, cache eviction, and spill strategies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partitioning and Data Distribution&lt;/strong&gt;: hash, range, bucketing, and partition pruning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed Join Algorithms&lt;/strong&gt;: shuffle, broadcast, co-located, and the cost of data movement&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Concurrency and Isolation&lt;/strong&gt;: locks, MVCC, isolation levels, and optimistic concurrency&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each article covers one decision area with diagrams, real-world system examples, and the specific tradeoffs involved. The goal is not to pick winners but to understand how the engineers who build these systems think through the problems.&lt;/p&gt;
&lt;h3&gt;Books to Go Deeper&lt;/h3&gt;
&lt;p&gt;If you want to go further into the systems that implement these patterns, check out these resources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/&quot;&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/&quot;&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/&quot;&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/&quot;&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/&quot;&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Zed: A Complete Guide to the High-Performance AI Code Editor</title><link>https://iceberglakehouse.com/posts/2026-03-context-zed/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-zed/</guid><description>
Zed is a high-performance code editor built in Rust that prioritizes speed, simplicity, and real-time collaboration. Its AI integration is designed t...</description><pubDate>Sat, 07 Mar 2026 23:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Zed is a high-performance code editor built in Rust that prioritizes speed, simplicity, and real-time collaboration. Its AI integration is designed to be fast and unobtrusive, with context management built around an assistant panel, inline transformations, slash commands, and a flexible provider system that supports multiple AI services. What sets Zed apart from other AI editors is its focus on performance (everything runs natively, not in Electron) and its built-in multiplayer editing that extends to AI interactions.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in Zed&apos;s AI features to get the most from its lightweight but capable AI integration.&lt;/p&gt;
&lt;h2&gt;How Zed Manages Context&lt;/h2&gt;
&lt;p&gt;Zed builds AI context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Assistant panel&lt;/strong&gt; - a dedicated panel for multi-turn conversations with persistent context threads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inline transformations&lt;/strong&gt; - context-aware edits triggered in the editor&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Slash commands&lt;/strong&gt; - special commands that inject structured context into prompts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active buffers&lt;/strong&gt; - files currently open in the editor&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project structure&lt;/strong&gt; - the workspace file tree&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom prompts library&lt;/strong&gt; - saved, reusable prompt templates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language server data&lt;/strong&gt; - type information and diagnostics from LSPs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers&lt;/strong&gt; - external tool connections (supported in recent versions)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Zed takes a minimalist approach to context management: rather than automatically indexing your entire codebase (like Cursor or Windsurf), it gives you explicit control over what goes into context through slash commands and file references.&lt;/p&gt;
&lt;h2&gt;The Assistant Panel: Structured Conversations&lt;/h2&gt;
&lt;p&gt;Zed&apos;s Assistant Panel is the primary interface for AI interactions that require context beyond the current file. It operates as a structured conversation where you build context explicitly.&lt;/p&gt;
&lt;h3&gt;How the Panel Works&lt;/h3&gt;
&lt;p&gt;The panel displays a conversation thread where each message can include code blocks, file references, and slash command outputs. You compose messages, include context, and receive AI responses in a single, reviewable flow.&lt;/p&gt;
&lt;h3&gt;Persistent Context Threads&lt;/h3&gt;
&lt;p&gt;Each conversation in the panel is a persistent thread. You can name threads, save them, and return to them later. This means you can maintain ongoing conversations about specific features or architectural decisions without losing context between sessions.&lt;/p&gt;
&lt;h3&gt;Including Code from Open Buffers&lt;/h3&gt;
&lt;p&gt;You can drag files or code selections into the assistant panel to include them as context. This explicit inclusion model means you always know exactly what context the AI is working with, unlike tools that silently assemble context behind the scenes.&lt;/p&gt;
&lt;h2&gt;Zed&apos;s Explicit Context Philosophy&lt;/h2&gt;
&lt;p&gt;Zed&apos;s approach to context management is fundamentally different from editors like Cursor or Windsurf that automatically index and retrieve context. In Zed, you explicitly choose what context to provide through slash commands and file inclusions. This has important tradeoffs:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages of explicit context:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You always know what the AI is working with&lt;/li&gt;
&lt;li&gt;No surprises from irrelevant code being included&lt;/li&gt;
&lt;li&gt;Works well with smaller model context windows (no wasted tokens)&lt;/li&gt;
&lt;li&gt;Context is reproducible: the same slash commands always produce the same context&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoffs:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires more manual effort to set up context&lt;/li&gt;
&lt;li&gt;You need to know which files are relevant before asking&lt;/li&gt;
&lt;li&gt;The AI cannot discover related code on its own (unlike @codebase in other editors)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understanding this philosophy helps you use Zed&apos;s AI features effectively: invest time in selecting the right context rather than expecting the editor to figure it out for you.&lt;/p&gt;
&lt;h2&gt;Real-Time Collaboration and AI&lt;/h2&gt;
&lt;p&gt;Zed&apos;s built-in multiplayer editing extends to AI interactions. When collaborating in a shared workspace:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple developers can contribute to the same assistant panel conversation&lt;/li&gt;
&lt;li&gt;One developer can set up the context while another frames the question&lt;/li&gt;
&lt;li&gt;AI suggestions can be reviewed and discussed collaboratively in real time&lt;/li&gt;
&lt;li&gt;The AI&apos;s output is visible to all participants simultaneously&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes Zed uniquely suited for pair programming and team code review workflows that incorporate AI assistance.&lt;/p&gt;
&lt;h2&gt;Slash Commands: Explicit Context Injection&lt;/h2&gt;
&lt;p&gt;Slash commands are Zed&apos;s primary mechanism for injecting specific types of context into AI conversations.&lt;/p&gt;
&lt;h3&gt;Available Slash Commands&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/file [path]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include a specific file&apos;s content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/tab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include all currently open tabs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/diagnostics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include current LSP errors and warnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/search [query]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search the project and include results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/prompt [name]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Load a saved prompt template&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/now&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include the current date and time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/fetch [url]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fetch and include content from a URL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using Slash Commands Effectively&lt;/h3&gt;
&lt;p&gt;The power of slash commands is precision. Instead of sending your entire codebase as context, you choose exactly which files and information are relevant:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/file src/auth/middleware.ts
/file src/auth/types.ts
/diagnostics

I need to fix the TypeScript errors in the auth middleware.
The types file defines the expected interfaces.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This focused approach produces better results than sending the AI a vague question against a massive context window. Each piece of context is intentional and relevant.&lt;/p&gt;
&lt;h3&gt;/diagnostics for Error-Driven Context&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;/diagnostics&lt;/code&gt; command is particularly powerful because it pulls language server errors and warnings directly into the AI conversation. Instead of manually copying error messages, one command gives the AI structured diagnostic information.&lt;/p&gt;
&lt;h3&gt;/fetch for External Documentation&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;/fetch&lt;/code&gt; command retrieves content from URLs, making it easy to include external documentation, API specifications, or reference material without manual copying:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/fetch https://docs.myframework.com/api/routing

How do I implement nested routing using this framework&apos;s API?
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Custom Prompts Library&lt;/h2&gt;
&lt;p&gt;Zed maintains a library of saved prompts that you can reuse across conversations and projects.&lt;/p&gt;
&lt;h3&gt;Creating Custom Prompts&lt;/h3&gt;
&lt;p&gt;Navigate to the prompts library and create templates for common tasks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Code Review Template

Review the provided code for:

1. Security vulnerabilities (injection, XSS, CSRF)
2. Performance issues (N+1 queries, unnecessary allocations)
3. Error handling completeness
4. Type safety issues
5. Missing edge cases

For each issue found:

- Describe the problem
- Explain the risk
- Provide a fix
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Prompts&lt;/h3&gt;
&lt;p&gt;Load a saved prompt with the &lt;code&gt;/prompt&lt;/code&gt; slash command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/prompt code-review
/file src/api/users.ts
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This combines your predefined review criteria with the specific file, creating a structured, repeatable workflow.&lt;/p&gt;
&lt;h3&gt;When to Create Prompts&lt;/h3&gt;
&lt;p&gt;Create prompts for tasks you perform regularly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Code reviews with consistent criteria&lt;/li&gt;
&lt;li&gt;Documentation generation in a specific format&lt;/li&gt;
&lt;li&gt;Refactoring with specific patterns (extract function, apply interface)&lt;/li&gt;
&lt;li&gt;Test generation following your testing conventions&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;AI Provider Configuration&lt;/h2&gt;
&lt;p&gt;Zed supports multiple AI providers, giving you flexibility in model selection:&lt;/p&gt;
&lt;h3&gt;Supported Providers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;Claude models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;GPT models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local endpoint&lt;/td&gt;
&lt;td&gt;Private, local models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;Gemini models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;Multi-provider routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any OpenAI-compatible endpoint&lt;/td&gt;
&lt;td&gt;Self-hosted models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Context Window Implications&lt;/h3&gt;
&lt;p&gt;Different providers offer different context window sizes. With Zed&apos;s explicit context management (where you choose what to include via slash commands), you have good visibility into how much context you are using. If you are working with a smaller model through Ollama, be more selective with your slash commands. With a large cloud model, you can include more files.&lt;/p&gt;
&lt;h3&gt;Configuring in settings.json&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;language_model&amp;quot;: {
    &amp;quot;provider&amp;quot;: &amp;quot;anthropic&amp;quot;,
    &amp;quot;model&amp;quot;: &amp;quot;claude-sonnet-4-20250514&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Inline Transformations&lt;/h2&gt;
&lt;p&gt;For quick edits that do not require a full conversation, Zed&apos;s inline transformation feature lets you select code and apply AI-powered changes directly in the editor.&lt;/p&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Select code in the editor&lt;/li&gt;
&lt;li&gt;Trigger the inline transformation (keyboard shortcut)&lt;/li&gt;
&lt;li&gt;Type your instruction (&amp;quot;Add error handling&amp;quot; or &amp;quot;Convert to async/await&amp;quot;)&lt;/li&gt;
&lt;li&gt;Zed applies the change inline&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Context for Inline Transformations&lt;/h3&gt;
&lt;p&gt;Inline transformations use a focused context: the current file, the selection, and your instruction. They do not load your custom prompts or conversation history. This makes them fast and appropriate for small, self-contained changes.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Recent versions of Zed support MCP for connecting to external tools. The implementation follows the standard MCP pattern: configure servers in settings, and their tools become available within the assistant panel.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP in Zed&lt;/h3&gt;
&lt;p&gt;MCP is most useful when the assistant needs live data (database schemas, API responses, running service status) that cannot be obtained from static files. For code-only tasks, the slash commands and file references are sufficient.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown for Everything You Control&lt;/h3&gt;
&lt;p&gt;Prompts, reference documents, and coding standards should be Markdown. Zed&apos;s prompt library and slash commands work natively with text-based formats.&lt;/p&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;Zed does not have built-in PDF parsing. For reference material in PDF form, extract relevant sections into Markdown files in your project and reference them with &lt;code&gt;/file&lt;/code&gt;. Alternatively, use &lt;code&gt;/fetch&lt;/code&gt; if the content is available online.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in Zed&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Inline Edits)&lt;/h3&gt;
&lt;p&gt;Select code, trigger inline transformation, describe the change. The current file and selection provide sufficient context for small changes.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Work)&lt;/h3&gt;
&lt;p&gt;Use the assistant panel with targeted slash commands: &lt;code&gt;/file&lt;/code&gt; for relevant files, &lt;code&gt;/diagnostics&lt;/code&gt; for current errors, &lt;code&gt;/prompt&lt;/code&gt; for your coding standards.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Architecture)&lt;/h3&gt;
&lt;p&gt;Include multiple files via &lt;code&gt;/file&lt;/code&gt; or &lt;code&gt;/tab&lt;/code&gt;, load architecture documentation via &lt;code&gt;/fetch&lt;/code&gt;, and load your team&apos;s conventions via &lt;code&gt;/prompt&lt;/code&gt;. Build the context explicitly and review it before asking complex questions.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Multi-File Context Pattern&lt;/h3&gt;
&lt;p&gt;For changes that span multiple files:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/file src/models/user.ts
/file src/services/userService.ts
/file src/routes/users.ts
/file tests/services/userService.test.ts

Add a &amp;quot;preferences&amp;quot; field to the User model and propagate it through the service layer, API routes, and tests.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Diagnostic-Driven Fix Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Run your build or test suite&lt;/li&gt;
&lt;li&gt;Open the assistant panel&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/diagnostics&lt;/code&gt; to load all current errors&lt;/li&gt;
&lt;li&gt;Ask the AI to fix the errors systematically&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Collaborative AI Pattern&lt;/h3&gt;
&lt;p&gt;Zed&apos;s multiplayer features mean multiple developers can collaborate in real time while using AI. One developer can set up the context (load files, configure the prompt) while another reviews the AI&apos;s output. This collaborative workflow is unique to Zed and makes it particularly effective for pair programming with AI assistance.&lt;/p&gt;
&lt;h3&gt;The Speed-Focused Workflow&lt;/h3&gt;
&lt;p&gt;For developers who prioritize responsiveness:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use Ollama with a fast local model for inline transformations&lt;/li&gt;
&lt;li&gt;Use a cloud model for assistant panel conversations that need more capability&lt;/li&gt;
&lt;li&gt;Keep assistant conversations focused and short&lt;/li&gt;
&lt;li&gt;Use inline transformations for most edits, reserving the panel for complex tasks&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-including context.&lt;/strong&gt; Zed gives you explicit control over context. Use it wisely. Including every file in your project via &lt;code&gt;/tab&lt;/code&gt; when only 2 files are relevant dilutes the AI&apos;s focus.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using saved prompts.&lt;/strong&gt; If you repeat the same instructions across conversations, save them as prompts. One &lt;code&gt;/prompt code-review&lt;/code&gt; is better than retyping your review criteria every time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring /diagnostics.&lt;/strong&gt; This command provides structured error context that is faster and more accurate than manually pasting error messages.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using the assistant panel for simple edits.&lt;/strong&gt; Inline transformations are faster and require less context setup. Use the panel for complex, multi-file work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not exploring provider options.&lt;/strong&gt; If response quality is not meeting expectations, try a different model. Zed&apos;s multi-provider support makes switching easy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forgetting /fetch for documentation.&lt;/strong&gt; External docs can be pulled directly into context without leaving the editor. This is faster and more reliable than manually copying content.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Windsurf: A Complete Guide to the AI Flow IDE</title><link>https://iceberglakehouse.com/posts/2026-03-context-windsurf/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-windsurf/</guid><description>
Windsurf is an AI-powered IDE built on the VS Code foundation that introduces the concept of &quot;Flows,&quot; a paradigm where the AI maintains deep awarenes...</description><pubDate>Sat, 07 Mar 2026 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Windsurf is an AI-powered IDE built on the VS Code foundation that introduces the concept of &amp;quot;Flows,&amp;quot; a paradigm where the AI maintains deep awareness of your actions, codebase, and development patterns over time. Its context management differentiates from other editors through Cascade (its agentic coding assistant), persistent Rules files, Memories, and a sophisticated context engine that tracks not just what files you are editing, but how you work.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism in Windsurf and explains how to configure them for the most productive development experience.&lt;/p&gt;
&lt;h2&gt;How Windsurf Manages Context&lt;/h2&gt;
&lt;p&gt;Windsurf assembles context through multiple layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Cascade context engine&lt;/strong&gt; - tracks your edits, terminal commands, and navigation patterns in real time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rules files&lt;/strong&gt; - project and global instructions that shape AI behavior&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memories&lt;/strong&gt; - persistent facts that carry across sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workspace index&lt;/strong&gt; - semantic index of your codebase&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation context&lt;/strong&gt; - the current chat session in Cascade&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active editor state&lt;/strong&gt; - the file you are editing, your cursor position, selected text&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; - external tools and data sources&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The &amp;quot;Flows&amp;quot; concept means Windsurf&apos;s AI is not just responding to individual prompts. It maintains a continuous understanding of what you are doing, which enables more relevant suggestions and fewer context-setting instructions from you.&lt;/p&gt;
&lt;h2&gt;Rules Files: Persistent Project Instructions&lt;/h2&gt;
&lt;p&gt;Windsurf uses Rules files to define project-level and global instructions for the AI.&lt;/p&gt;
&lt;h3&gt;Global Rules&lt;/h3&gt;
&lt;p&gt;Set in Windsurf Settings under AI &amp;gt; Rules, global rules apply across all projects:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Global Rules

## My Preferences

- Always use TypeScript over JavaScript
- Prefer functional programming patterns
- Use descriptive variable names (no single-letter variables except in loops)
- Add JSDoc comments to all exported functions

## Communication Style

- Be direct and concise
- Show code changes as diffs when possible
- Explain non-obvious design decisions
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Project Rules (Workspace)&lt;/h3&gt;
&lt;p&gt;Create project-level rules in your workspace:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.windsurfrules&lt;/code&gt; file in the project root&lt;/li&gt;
&lt;li&gt;Or &lt;code&gt;.windsurf/rules/&lt;/code&gt; directory with multiple rule files&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: E-Commerce Platform

## Stack

- Next.js 15 with App Router
- TypeScript 5.6
- PostgreSQL with Prisma ORM
- Tailwind CSS 4
- Vitest for testing

## Architecture

- app/ contains page routes and layouts
- lib/ contains shared utilities and API clients
- components/ contains UI components (Atomic Design: atoms, molecules, organisms)
- prisma/ contains schema and migrations

## Conventions

- Server Components by default, Client Components only when necessary
- Use Zod for all input validation
- API routes use the route handler pattern with error boundaries
- All database queries go through Prisma transactions for writes

## Testing

- Every new component needs a unit test
- API routes need integration tests with a test database
- Use MSW for mocking external API calls
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rules Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Put rules files in version control so the entire team follows the same conventions&lt;/li&gt;
&lt;li&gt;Keep rules actionable and specific, not aspirational&lt;/li&gt;
&lt;li&gt;Include negative constraints (&amp;quot;Do not use inline styles&amp;quot;)&lt;/li&gt;
&lt;li&gt;Update rules when you change frameworks, libraries, or conventions&lt;/li&gt;
&lt;li&gt;Separate global preferences from project-specific rules&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Memories: Persistent Knowledge&lt;/h2&gt;
&lt;p&gt;Windsurf&apos;s Memory system stores facts that persist across conversations and sessions. Memories can be created automatically (when the AI identifies important information during a conversation) or manually.&lt;/p&gt;
&lt;h3&gt;How Memories Work&lt;/h3&gt;
&lt;p&gt;When you share something important in a conversation (&amp;quot;We decided to switch from REST to GraphQL for the new API&amp;quot;), Windsurf can save this as a Memory. In future sessions, the AI loads relevant Memories to maintain continuity.&lt;/p&gt;
&lt;h3&gt;Managing Memories&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;View all Memories in Windsurf Settings&lt;/li&gt;
&lt;li&gt;Delete outdated Memories that no longer apply&lt;/li&gt;
&lt;li&gt;Manually add Memories for important decisions the AI should always remember&lt;/li&gt;
&lt;li&gt;Review periodically to keep the memory store accurate&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Memories vs. Rules&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;th&gt;Memories&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Creation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You write them explicitly&lt;/td&gt;
&lt;td&gt;Created during conversations or manually&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global or project-level&lt;/td&gt;
&lt;td&gt;Cross-project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Define conventions and constraints&lt;/td&gt;
&lt;td&gt;Store facts and decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Update frequency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When conventions change&lt;/td&gt;
&lt;td&gt;As new decisions are made&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Use Rules for standards and conventions. Use Memories for facts and decisions.&lt;/p&gt;
&lt;h2&gt;Cascade: The Agentic AI Assistant&lt;/h2&gt;
&lt;p&gt;Cascade is Windsurf&apos;s agentic coding assistant. It operates in two modes with different context management implications:&lt;/p&gt;
&lt;h3&gt;Chat Mode&lt;/h3&gt;
&lt;p&gt;Standard conversational interaction where you ask questions and receive answers. Context includes the active file, conversation history, and any files you reference.&lt;/p&gt;
&lt;h3&gt;Agent Mode&lt;/h3&gt;
&lt;p&gt;Autonomous mode where Cascade plans and executes multi-step tasks. In Agent Mode, Cascade:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reads and writes files across your project&lt;/li&gt;
&lt;li&gt;Runs terminal commands&lt;/li&gt;
&lt;li&gt;Navigates and explores the codebase&lt;/li&gt;
&lt;li&gt;Creates and executes multi-file changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent Mode benefits from more comprehensive context (Rules, Memories, workspace index) because it operates autonomously without constant guidance.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Windsurf supports MCP for connecting to external tools and data sources.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;Configure MCP servers through Windsurf Settings or in a configuration file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;database&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://dev@localhost:5432/mydb&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;Use MCP in Windsurf for the same scenarios as other IDE-based tools: database queries, GitHub integration, API testing, and browser automation. The integration is seamless because MCP tools become available within Cascade&apos;s agent mode.&lt;/p&gt;
&lt;h2&gt;Model Selection and Context Configuration&lt;/h2&gt;
&lt;p&gt;Windsurf supports multiple AI providers and models. Your model choice affects context management because different models handle different context window sizes and reasoning capabilities.&lt;/p&gt;
&lt;h3&gt;Configuring the AI Provider&lt;/h3&gt;
&lt;p&gt;In Windsurf Settings, you can select from multiple providers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Windsurf&apos;s own models&lt;/strong&gt; (optimized for the Windsurf context system)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anthropic&lt;/strong&gt; (Claude Sonnet, Opus)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenAI&lt;/strong&gt; (GPT-4o, o3)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom endpoints&lt;/strong&gt; (any OpenAI-compatible API)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For complex refactoring that touches many files, choose a model with a larger context window. For quick completions and small edits, a faster model with a smaller window is more responsive.&lt;/p&gt;
&lt;h3&gt;Tab Completion Context&lt;/h3&gt;
&lt;p&gt;Windsurf&apos;s Tab completion (inline autocomplete) uses a separate context pipeline from Cascade. The completion context includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The current file content&lt;/li&gt;
&lt;li&gt;Recently edited files&lt;/li&gt;
&lt;li&gt;Import statements and type definitions&lt;/li&gt;
&lt;li&gt;Patterns from your codebase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understanding this separation matters because Tab completions are optimized for speed (low latency) while Cascade chat is optimized for depth (comprehensive reasoning). The context for each is assembled differently to match their respective use cases.&lt;/p&gt;
&lt;h2&gt;How Windsurf Assembles Context&lt;/h2&gt;
&lt;p&gt;When you interact with Cascade, Windsurf assembles context through this pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Load Rules&lt;/strong&gt;: Global rules first, then project rules from .windsurfrules&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load Memories&lt;/strong&gt;: Retrieve relevant persistent facts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include active editor state&lt;/strong&gt;: Current file, cursor position, selection&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Process @-commands&lt;/strong&gt;: Add referenced files, codebase search results, web results&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add flow context&lt;/strong&gt;: Recent edits, terminal output, navigation patterns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apply model constraints&lt;/strong&gt;: Trim to fit within the model&apos;s context window&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pipeline runs automatically for every interaction. The more you invest in Rules and Memories, the more relevant the automatically assembled context becomes.&lt;/p&gt;
&lt;h2&gt;Onboarding a New Project to Windsurf&lt;/h2&gt;
&lt;p&gt;Here is a step-by-step process for setting up effective context management on a new project:&lt;/p&gt;
&lt;h3&gt;Day 1: Foundation&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Open the project in Windsurf and let the workspace indexing complete&lt;/li&gt;
&lt;li&gt;Create a &lt;code&gt;.windsurfrules&lt;/code&gt; file with your stack, architecture, and conventions&lt;/li&gt;
&lt;li&gt;Make a few small changes to verify Windsurf follows your conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Day 2: Refinement&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Review what Memories Windsurf created from Day 1&lt;/li&gt;
&lt;li&gt;Add any important project facts as manual Memories&lt;/li&gt;
&lt;li&gt;Adjust Rules based on how Cascade behaved on Day 1&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Week 2: Advanced Setup&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Connect relevant MCP servers (database, GitHub)&lt;/li&gt;
&lt;li&gt;Index external documentation for @docs references&lt;/li&gt;
&lt;li&gt;Start using Agent Mode for multi-file changes&lt;/li&gt;
&lt;li&gt;Create directory-specific rules if different modules have different conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Quick Edits (Minimal Context)&lt;/h3&gt;
&lt;p&gt;Use inline editing (Cmd+K / Ctrl+K) for small changes. Windsurf uses the current file and selection, plus applicable Rules, to generate edits. No additional context needed.&lt;/p&gt;
&lt;h3&gt;Feature Development (Moderate Context)&lt;/h3&gt;
&lt;p&gt;Use Cascade chat with explicit file references. The workspace index, Rules, and Memories combine to give Cascade project-aware responses.&lt;/p&gt;
&lt;h3&gt;Complex Architecture Work (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;Use Agent Mode with well-configured Rules, active Memories, and MCP connections. Let Cascade explore the codebase, run commands, and make changes across multiple files.&lt;/p&gt;
&lt;h2&gt;@ Commands for Context Injection&lt;/h2&gt;
&lt;p&gt;Windsurf supports @-commands similar to Cursor for injecting specific context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;@file&lt;/strong&gt; - Reference a specific file&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@codebase&lt;/strong&gt; - Search the indexed codebase&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@web&lt;/strong&gt; - Search the web for current information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@docs&lt;/strong&gt; - Reference indexed documentation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@terminal&lt;/strong&gt; - Include terminal output context&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These commands give you fine-grained control over what context Cascade receives for each prompt.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown for Rules&lt;/h3&gt;
&lt;p&gt;All Rules files and project documentation should be Markdown. It is the native format for Windsurf&apos;s context system.&lt;/p&gt;
&lt;h3&gt;For Reference Material&lt;/h3&gt;
&lt;p&gt;For external specifications in PDF form, convert key sections to Markdown and include them in your project as reference documents. This makes them discoverable through @codebase searches.&lt;/p&gt;
&lt;h3&gt;Documentation Indexing&lt;/h3&gt;
&lt;p&gt;Like Cursor, Windsurf can index external documentation. Add framework and library docs to the indexed sources so @docs references return relevant, up-to-date information.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Flow-Aware Development Pattern&lt;/h3&gt;
&lt;p&gt;Leverage Windsurf&apos;s flow tracking by working naturally:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Make edits in the editor (Windsurf tracks your changes)&lt;/li&gt;
&lt;li&gt;Run tests in the terminal (Windsurf observes the results)&lt;/li&gt;
&lt;li&gt;Ask Cascade a question (it already knows what you changed and what failed)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This removes the need to manually explain what you just did. Windsurf already knows.&lt;/p&gt;
&lt;h3&gt;The Rules-Layered Workflow&lt;/h3&gt;
&lt;p&gt;Combine global and project rules for comprehensive coverage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Global rules: Your personal coding style and preferences&lt;/li&gt;
&lt;li&gt;Project rules: Team conventions and architecture decisions&lt;/li&gt;
&lt;li&gt;Directory-specific rules: Module-specific patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Agent-Then-Review Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Describe the feature to Cascade in Agent Mode&lt;/li&gt;
&lt;li&gt;Let it plan and implement the changes&lt;/li&gt;
&lt;li&gt;Review each file change in the diff view&lt;/li&gt;
&lt;li&gt;Accept, reject, or modify individual changes&lt;/li&gt;
&lt;li&gt;Ask Cascade to adjust based on your feedback&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This uses Agent Mode for speed while maintaining human oversight through the review step.&lt;/p&gt;
&lt;h3&gt;The Memory-Driven Continuity Pattern&lt;/h3&gt;
&lt;p&gt;At the end of each working session, review what Windsurf has stored as Memories. Add any important decisions or discoveries that were not automatically captured. At the start of the next session, Cascade starts with a richer understanding of your project.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not setting up Rules files.&lt;/strong&gt; Without them, Cascade applies generic conventions. Project-specific Rules are the highest-impact configuration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Memories.&lt;/strong&gt; Stale Memories mislead the AI. Review and clean them periodically.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Underusing Agent Mode.&lt;/strong&gt; For multi-file changes, Agent Mode is dramatically faster than chat-based interactions. Trust it for structural changes and review the results.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-specifying context in prompts.&lt;/strong&gt; If your Rules and Memories are well-configured, you do not need to re-explain your conventions in every prompt.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not leveraging flow awareness.&lt;/strong&gt; Windsurf tracks your actions. Instead of explaining what you just did, ask questions that build on your recent work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping @codebase for exploration.&lt;/strong&gt; When you are unsure which files are relevant, @codebase search is more efficient than manually navigating the project tree.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Perplexity AI: A Complete Guide to Research-First AI Conversations</title><link>https://iceberglakehouse.com/posts/2026-03-context-perplexity/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-perplexity/</guid><description>
Perplexity AI occupies a unique position in the AI landscape: it is a research-first tool that combines conversational AI with real-time web search t...</description><pubDate>Sat, 07 Mar 2026 21:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Perplexity AI occupies a unique position in the AI landscape: it is a research-first tool that combines conversational AI with real-time web search to produce answers grounded in current sources. Unlike coding-focused tools or general chatbots, Perplexity is built for information retrieval, analysis, and synthesis. Its context management is designed around Spaces (persistent research workspaces), Focus Modes (search scope control), and an elastic context window that adapts to the complexity of your query.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in Perplexity for everything from quick fact-checking to sustained research projects.&lt;/p&gt;
&lt;h2&gt;How Perplexity Manages Context&lt;/h2&gt;
&lt;p&gt;Perplexity builds context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Web search results&lt;/strong&gt; - real-time retrieval of current information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spaces&lt;/strong&gt; - persistent workspaces with uploaded files and custom instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Focus Modes&lt;/strong&gt; - filters that control which sources are searched&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - the thread of questions and answers in the current session&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uploaded files&lt;/strong&gt; - documents you provide for analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt; - persistent facts the system remembers about you (enterprise plans)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key difference from other AI tools is that Perplexity actively searches the web for every query by default. This means its context combines your instructions and uploaded files with fresh, real-time information from the internet, producing answers with citations that you can verify.&lt;/p&gt;
&lt;h2&gt;Spaces: Persistent Research Workspaces&lt;/h2&gt;
&lt;p&gt;Spaces are Perplexity&apos;s equivalent of Projects in other tools. A Space groups related conversations, files, and instructions into a persistent workspace.&lt;/p&gt;
&lt;h3&gt;Creating a Space&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Spaces&lt;/strong&gt; in the sidebar&lt;/li&gt;
&lt;li&gt;Create a new Space with a descriptive name&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;Custom Instructions&lt;/strong&gt;: Guidelines that shape every response in this Space&lt;/li&gt;
&lt;li&gt;Upload &lt;strong&gt;files&lt;/strong&gt;: PDFs, documents, spreadsheets, and other reference material&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;default Focus Mode&lt;/strong&gt; for the Space&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Space Instructions&lt;/h3&gt;
&lt;p&gt;Instructions in a Space function like a system prompt for every conversation within it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Research Space: Renewable Energy Markets

## Role

You are a market research assistant focused on renewable energy.

## Requirements

- Cite all claims with sources less than 6 months old
- Include market size and growth rate data when available
- Compare data across geographic regions when relevant
- Flag any statistics from sources over 1 year old

## Format

- Use structured sections with clear headers
- Include a &amp;quot;Sources&amp;quot; section at the end of every response
- Present data in tables when comparing multiple items
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;File Uploads in Spaces&lt;/h3&gt;
&lt;p&gt;Spaces support various file types for persistent reference:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Research papers, reports, whitepapers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Analysis templates, style guides&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spreadsheets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data for analysis and comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text/Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Notes, outlines, custom context documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Files in a Space are available across all conversations in that Space. This means you upload a report once and can reference it in every subsequent conversation.&lt;/p&gt;
&lt;h3&gt;When to Create a Space&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You are researching a topic across multiple sessions&lt;/li&gt;
&lt;li&gt;You have reference documents you want the AI to consult alongside web results&lt;/li&gt;
&lt;li&gt;You need consistent response formatting and focus&lt;/li&gt;
&lt;li&gt;You are working on a project that requires accumulating research over time&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Focus Modes: Controlling Search Scope&lt;/h2&gt;
&lt;p&gt;Focus Modes let you control where Perplexity searches for information:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Focus Mode&lt;/th&gt;
&lt;th&gt;Sources&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;All&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Entire web&lt;/td&gt;
&lt;td&gt;General research, broad questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Academic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Scholar, research databases&lt;/td&gt;
&lt;td&gt;Scientific research, literature reviews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Writing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No web search (uses training data)&lt;/td&gt;
&lt;td&gt;Content creation, drafting, brainstorming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Math&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Computation-focused, mathematical sources&lt;/td&gt;
&lt;td&gt;Calculations, proofs, statistical analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Video&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YouTube and video platforms&lt;/td&gt;
&lt;td&gt;Tutorial discovery, visual explanations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Social&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reddit, forums, social platforms&lt;/td&gt;
&lt;td&gt;Community opinions, user experiences, discussions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using Focus Modes as Context Filters&lt;/h3&gt;
&lt;p&gt;Focus Modes are a form of context management because they determine what kind of information reaches the model. Choosing the right Focus Mode prevents irrelevant results from diluting the response:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Researching a technical specification?&lt;/strong&gt; Use &amp;quot;All&amp;quot; for comprehensive coverage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing a literature review?&lt;/strong&gt; Use &amp;quot;Academic&amp;quot; to prioritize peer-reviewed sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looking for real-world experiences?&lt;/strong&gt; Use &amp;quot;Social&amp;quot; to surface personal accounts and community discussions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Drafting text without needing web data?&lt;/strong&gt; Use &amp;quot;Writing&amp;quot; to focus on generation rather than retrieval&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Switching Focus Modes Mid-Research&lt;/h3&gt;
&lt;p&gt;You can switch Focus Modes within a Space. A common pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with &amp;quot;Academic&amp;quot; to find foundational research&lt;/li&gt;
&lt;li&gt;Switch to &amp;quot;All&amp;quot; for industry reports and market data&lt;/li&gt;
&lt;li&gt;Use &amp;quot;Social&amp;quot; to gauge public perception and user experiences&lt;/li&gt;
&lt;li&gt;Switch to &amp;quot;Writing&amp;quot; to draft your synthesis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each mode shapes the context differently, giving you control over the type of information the model works with.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Quick Questions (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For factual questions with clear answers, just ask. Perplexity will search the web and return a sourced response:&lt;/p&gt;
&lt;p&gt;&amp;quot;What is the current market size of the global data analytics industry?&amp;quot;&lt;/p&gt;
&lt;p&gt;No Space, no file uploads, no special Focus Mode needed. Perplexity&apos;s default behavior handles this well.&lt;/p&gt;
&lt;h3&gt;Focused Research (Moderate Context)&lt;/h3&gt;
&lt;p&gt;For deeper exploration, create a Space with instructions and upload relevant reference material:&lt;/p&gt;
&lt;p&gt;&amp;quot;Based on the market report I uploaded and current web data, compare the growth trajectories of the three largest cloud providers in the data analytics space.&amp;quot;&lt;/p&gt;
&lt;p&gt;The combination of uploaded files (for baseline data) and web search (for current information) produces comprehensive analysis.&lt;/p&gt;
&lt;h3&gt;Extended Research Projects (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;For multi-week research projects, use a fully configured Space with detailed instructions, multiple uploaded documents, and strategic Focus Mode switching. Build on previous conversations by referencing insights from earlier threads.&lt;/p&gt;
&lt;h2&gt;Deep Research 2.0&lt;/h2&gt;
&lt;p&gt;Perplexity&apos;s Deep Research feature performs multi-step research autonomously. When you invoke Deep Research, the system:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Analyzes your question and creates a research plan&lt;/li&gt;
&lt;li&gt;Executes multiple web searches across diverse sources&lt;/li&gt;
&lt;li&gt;Reads and analyzes full articles (not just snippets)&lt;/li&gt;
&lt;li&gt;Synthesizes findings into a comprehensive report&lt;/li&gt;
&lt;li&gt;Provides structured output with citations for every claim&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Deep Research is available on Pro plans and uses significantly more compute than standard queries. The tradeoff is worth it for complex questions that require multi-source synthesis.&lt;/p&gt;
&lt;h3&gt;Context Management for Deep Research&lt;/h3&gt;
&lt;p&gt;Deep Research benefits from clear, specific prompts. Because the system executes autonomously, your initial prompt is the primary context it works from:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Less effective:&lt;/strong&gt; &amp;quot;Tell me about AI in healthcare&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;More effective:&lt;/strong&gt; &amp;quot;Research the current state of AI-powered diagnostic tools in radiology. Focus on: (1) FDA-approved systems as of 2026, (2) clinical accuracy compared to human radiologists, (3) adoption rates across US hospitals, and (4) barriers to wider adoption. Prioritize peer-reviewed sources and official regulatory data.&amp;quot;&lt;/p&gt;
&lt;p&gt;The specific prompt gives Deep Research a structured plan to follow, producing a more focused and useful report.&lt;/p&gt;
&lt;h2&gt;Structuring Prompts for Effective Context&lt;/h2&gt;
&lt;h3&gt;The Research Question Framework&lt;/h3&gt;
&lt;p&gt;Structure your prompts using this framework for best results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Topic:&lt;/strong&gt; What are you researching?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scope:&lt;/strong&gt; What specific aspects matter?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sources:&lt;/strong&gt; What type of sources do you want?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recency:&lt;/strong&gt; How current must the information be?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Format:&lt;/strong&gt; How should the response be structured?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Example: &amp;quot;Research [topic]. Focus on [scope]. Prioritize [source type] from [time period]. Present findings as [format].&amp;quot;&lt;/p&gt;
&lt;h3&gt;Follow-Up Strategies&lt;/h3&gt;
&lt;p&gt;Perplexity maintains conversation context within a thread. Use follow-ups strategically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Drilling down:&lt;/strong&gt; &amp;quot;Tell me more about point 3 from your previous response&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pivoting:&lt;/strong&gt; &amp;quot;How does this compare to the European market?&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validating:&lt;/strong&gt; &amp;quot;Find additional sources that support or contradict the statistics you cited&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Updating:&lt;/strong&gt; &amp;quot;What has changed on this topic in the last 3 months?&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each follow-up builds on the accumulated context of the conversation, producing progressively deeper analysis.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Perplexity supports local MCP (Model Context Protocol) servers on its macOS desktop application. This allows the AI to connect to external tools and data sources running on your local machine, extending its capabilities beyond web search.&lt;/p&gt;
&lt;h3&gt;How MCP Works in Perplexity&lt;/h3&gt;
&lt;p&gt;On the macOS app, you can configure local MCP servers that provide Perplexity with access to your file system, local databases, applications, and other services. This is configured through the app&apos;s settings. Remote MCP servers (cloud-based services) are planned for paid subscribers.&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value&lt;/h3&gt;
&lt;p&gt;For most Perplexity use cases, web search is the primary context extension mechanism. MCP adds value when you need Perplexity to combine its web research capabilities with local data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Researching a topic while cross-referencing your local documents&lt;/li&gt;
&lt;li&gt;Analyzing data from a local database alongside web-sourced information&lt;/li&gt;
&lt;li&gt;Integrating with local development tools or APIs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Web Search vs. MCP&lt;/h3&gt;
&lt;p&gt;Where other tools use MCP to reach databases or APIs, Perplexity&apos;s distinguishing feature is its web search capability. MCP complements this by adding local data access, but for most research workflows, Perplexity&apos;s web search provides the primary context extension. If you need extensive MCP functionality (writing code, managing databases, interacting with multiple external services), pair Perplexity with a coding-focused tool like Claude Desktop or Cursor.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;PDFs in Perplexity&lt;/h3&gt;
&lt;p&gt;Perplexity handles PDFs well, especially for research papers and reports. Upload them to a Space for persistent reference. Perplexity can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extract text and answer questions about the content&lt;/li&gt;
&lt;li&gt;Compare information across multiple uploaded PDFs&lt;/li&gt;
&lt;li&gt;Combine uploaded PDF data with web search results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown&lt;/h3&gt;
&lt;p&gt;For context documents you author (instructions, outlines, research frameworks), Markdown is cleaner and more precisely parsed. Use Markdown for structure-dependent content where formatting matters.&lt;/p&gt;
&lt;h3&gt;The Hybrid Approach&lt;/h3&gt;
&lt;p&gt;Use PDFs for received documents (research papers, reports, specifications). Use Markdown for documents you create (Space instructions, research frameworks, output templates).&lt;/p&gt;
&lt;h2&gt;Memory (Enterprise)&lt;/h2&gt;
&lt;p&gt;On enterprise plans, Perplexity supports persistent Memory that remembers facts about you across conversations. This is similar to ChatGPT&apos;s Memory feature and stores preferences, role information, and recurring context that you should not have to re-state every time.&lt;/p&gt;
&lt;p&gt;For individual users, Spaces serve a similar purpose by maintaining per-workspace instructions and files, even though the memory mechanism is different.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Research Pipeline Pattern&lt;/h3&gt;
&lt;p&gt;Use Perplexity as the front end of a research pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Discovery:&lt;/strong&gt; Use Deep Research to survey a topic comprehensively&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validation:&lt;/strong&gt; Switch to Academic Focus to verify key claims with peer-reviewed sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Community insight:&lt;/strong&gt; Switch to Social Focus to understand real-world adoption and reception&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Synthesis:&lt;/strong&gt; Switch to Writing Focus to draft your analysis based on the accumulated context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Export:&lt;/strong&gt; Copy the synthesized research into your writing tool of choice&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Comparative Analysis Pattern&lt;/h3&gt;
&lt;p&gt;Use Spaces to compare multiple topics or options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload comparison criteria as a Markdown file&lt;/li&gt;
&lt;li&gt;Research each option in a separate conversation within the Space&lt;/li&gt;
&lt;li&gt;Use a final conversation to synthesize the findings into a comparison table&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The Space maintains the criteria and accumulated research across all conversations.&lt;/p&gt;
&lt;h3&gt;The Source Quality Verification Pattern&lt;/h3&gt;
&lt;p&gt;Use Focus Mode switching to verify claims across different source types:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Find a claim in &amp;quot;All&amp;quot; mode&lt;/li&gt;
&lt;li&gt;Verify it in &amp;quot;Academic&amp;quot; mode (peer-reviewed backing)&lt;/li&gt;
&lt;li&gt;Check reception in &amp;quot;Social&amp;quot; mode (how practitioners view the claim)&lt;/li&gt;
&lt;li&gt;Check for retractions or updates in &amp;quot;All&amp;quot; mode with a date filter&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This multi-angle verification produces higher-confidence research than relying on a single source type.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Spaces for project research.&lt;/strong&gt; Individual conversations lose context when you close them. Spaces maintain your instructions, files, and conversation history persistently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Focus Modes.&lt;/strong&gt; Using &amp;quot;All&amp;quot; for everything misses the specialized results that Academic, Social, and other modes provide. Match the mode to the question.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vague Deep Research prompts.&lt;/strong&gt; Deep Research executes autonomously, so a vague prompt produces a vague report. Be specific about what you want investigated and how you want it structured.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uploading too many unrelated files to one Space.&lt;/strong&gt; Keep Spaces focused on specific topics. A Space with 30 unrelated documents dilutes the context for any specific query.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not verifying citations.&lt;/strong&gt; Perplexity provides source citations for a reason. Click through and verify key claims, especially for high-stakes research.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using Perplexity for tasks that need code execution or local tool access.&lt;/strong&gt; Perplexity is a research tool, not a coding agent. For tasks requiring code execution, terminal access, or database interaction, use a coding-focused tool instead.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted research, context management, and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Cursor: A Complete Guide to the AI-Native Code Editor</title><link>https://iceberglakehouse.com/posts/2026-03-context-cursor/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-cursor/</guid><description>
Cursor is an AI-native code editor built on the VS Code foundation that integrates AI deeply into every aspect of the development workflow. Its conte...</description><pubDate>Sat, 07 Mar 2026 20:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Cursor is an AI-native code editor built on the VS Code foundation that integrates AI deeply into every aspect of the development workflow. Its context management system is one of the most sophisticated among coding tools, combining workspace-level indexing, granular rules files, documentation integration, MCP server support, and intelligent context assembly that automatically determines which files and symbols are relevant to your current task.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism Cursor provides and explains how to configure them for productive, reliable AI-assisted development.&lt;/p&gt;
&lt;h2&gt;How Cursor Manages Context&lt;/h2&gt;
&lt;p&gt;Cursor assembles context from multiple sources, with intelligent prioritization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Workspace index&lt;/strong&gt; - a semantic index of your entire codebase built on first open&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;.cursor/rules/ files&lt;/strong&gt; - project-specific instructions in MDC format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@-mentions&lt;/strong&gt; - explicit context you inject into prompts (@file, @codebase, @Docs)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; - external tools and data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active file and selection&lt;/strong&gt; - the code you are currently looking at&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - recent messages in the current chat session&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Debug context&lt;/strong&gt; - error messages, stack traces, and terminal output&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The workspace index is what makes Cursor&apos;s context management stand out. Instead of relying on you to specify which files are relevant, Cursor semantically indexes your entire project and retrieves the most relevant code based on your query.&lt;/p&gt;
&lt;h2&gt;.cursor/rules/: Project-Level Instructions&lt;/h2&gt;
&lt;p&gt;Cursor uses &lt;code&gt;.cursor/rules/&lt;/code&gt; files in MDC (Markdown Configuration) format to provide project-level instructions. These files tell Cursor how to behave within your project.&lt;/p&gt;
&lt;h3&gt;Rule Types&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Always&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Loaded for every interaction&lt;/td&gt;
&lt;td&gt;Core conventions, style preferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Loaded when matched files are active&lt;/td&gt;
&lt;td&gt;File-type specific rules (e.g., Python vs. TypeScript)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Available to the agent for self-selection&lt;/td&gt;
&lt;td&gt;Specialized knowledge the agent invokes when needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only loaded when explicitly referenced&lt;/td&gt;
&lt;td&gt;Rarely used instructions you invoke for specific tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Creating Rules&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.mdc&lt;/code&gt; files in &lt;code&gt;.cursor/rules/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Python coding standards for this project
globs: [&amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# Python Rules

## Style

- Use type hints for all function parameters and return values
- Use dataclasses or Pydantic models instead of plain dicts
- Prefer f-strings over .format() or %-formatting
- Maximum line length is 88 characters (Black default)

## Testing

- Use pytest, not unittest
- Test files mirror the source tree: src/services/auth.py -&amp;gt; tests/services/test_auth.py
- Use factories for test data, not fixtures
- Mock external services at the client boundary

## Architecture

- Business logic lives in src/services/
- Database access goes through src/repositories/
- API routes are thin: validate input, call service, return response
- Never import from internal modules; use the package&apos;s public API
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rules Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use globs to target rules.&lt;/strong&gt; Auto rules with specific glob patterns (like &lt;code&gt;**/*.py&lt;/code&gt;) keep Python conventions separate from JavaScript conventions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep rules actionable.&lt;/strong&gt; Every rule should describe a specific behavior the agent should follow. Vague guidance like &amp;quot;write clean code&amp;quot; wastes tokens.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document your architecture.&lt;/strong&gt; Tell Cursor where things live. Understanding your project structure prevents the agent from putting code in the wrong place.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include negative constraints.&lt;/strong&gt; &amp;quot;Do NOT use class-based views&amp;quot; is often more effective than a long description of what to use instead.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;@-Mentions: Explicit Context Injection&lt;/h2&gt;
&lt;p&gt;Cursor&apos;s @-mention system lets you add specific context to any prompt.&lt;/p&gt;
&lt;h3&gt;Available @-Mentions&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mention&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reference a specific file by name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@codebase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search the entire indexed codebase for relevant context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@Docs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search indexed documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@web&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search the web for current information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@git&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reference Git history (diffs, commits, branches)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@definitions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include symbol definitions referenced in your selection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@folders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include directory structure context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using @codebase Effectively&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;@codebase&lt;/code&gt; is the most powerful @-mention because it triggers semantic search across your entire project. When you type:&lt;/p&gt;
&lt;p&gt;&amp;quot;@codebase How is authentication implemented in this project?&amp;quot;&lt;/p&gt;
&lt;p&gt;Cursor searches its semantic index, retrieves the most relevant files and symbols, and includes them in the context. This is far more efficient than manually specifying each file.&lt;/p&gt;
&lt;h3&gt;@Docs: Documentation-Aware Context&lt;/h3&gt;
&lt;p&gt;You can index external documentation sources so Cursor can reference them:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Cursor Settings &amp;gt; Features &amp;gt; Docs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Add documentation URLs (framework docs, API references, internal wikis)&lt;/li&gt;
&lt;li&gt;Cursor crawls and indexes the documentation&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;@Docs&lt;/code&gt; in prompts to reference the indexed content&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Example: &amp;quot;Using @Docs for React 19, refactor this component to use the new use() hook.&amp;quot;&lt;/p&gt;
&lt;p&gt;This is particularly valuable for newer libraries where the AI&apos;s training data may be outdated.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Cursor supports MCP for connecting to external tools and services.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured in Cursor&apos;s settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://dev@localhost:5432/mydb&amp;quot;
      }
    },
    &amp;quot;github&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-github&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP in Cursor&lt;/h3&gt;
&lt;p&gt;MCP is most valuable when the task requires live data from outside the codebase:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Querying a development database to understand schema or verify data&lt;/li&gt;
&lt;li&gt;Interacting with GitHub for PR reviews or CI status&lt;/li&gt;
&lt;li&gt;Accessing internal APIs to verify integration behavior&lt;/li&gt;
&lt;li&gt;Running browser automation to test frontend changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For code-only tasks (refactoring, writing tests, fixing bugs), Cursor&apos;s built-in codebase index is sufficient.&lt;/p&gt;
&lt;h2&gt;Debug Mode and Error Context&lt;/h2&gt;
&lt;p&gt;Cursor offers a Debug Mode that automatically provides error context to the AI:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;When you encounter an error in the terminal or running application&lt;/li&gt;
&lt;li&gt;Cursor captures the error message, stack trace, and relevant file context&lt;/li&gt;
&lt;li&gt;You can ask the AI to diagnose and fix the issue with full context&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This automatic error context gathering is a significant context management feature because it eliminates the manual process of copying error messages and stack traces into prompts.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Fixes)&lt;/h3&gt;
&lt;p&gt;For small edits, select code in the editor and use inline editing (Cmd+K / Ctrl+K). Cursor uses the current file and selection as context. No additional setup needed.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Development)&lt;/h3&gt;
&lt;p&gt;Use the chat panel with @-mentions. Reference the relevant files with @file, use @codebase for broader understanding, and include @Docs for framework-specific guidance.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Architecture Work)&lt;/h3&gt;
&lt;p&gt;Combine .cursor/rules/ with @codebase and MCP servers. The rules provide your conventions, @codebase provides structural understanding, and MCP provides live system context.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown for Rules&lt;/h3&gt;
&lt;p&gt;All .cursor/rules/ files use MDC (Markdown-based) format. Your coding standards, style guides, and architectural documentation should be in this format.&lt;/p&gt;
&lt;h3&gt;Documentation Indexing&lt;/h3&gt;
&lt;p&gt;For external documentation, use the @Docs system to index web-based docs directly. This is more effective than converting PDFs to Markdown because Cursor handles the indexing and retrieval automatically.&lt;/p&gt;
&lt;h3&gt;For Reference Material&lt;/h3&gt;
&lt;p&gt;If you have specifications or design documents in PDF form, the most practical approach is to extract key sections into .mdc rule files or Markdown documents in your repository. This makes them searchable through @codebase.&lt;/p&gt;
&lt;h2&gt;Model Selection and Context Windows&lt;/h2&gt;
&lt;p&gt;Cursor supports multiple AI providers and models. Your model choice affects context management because different models have different context window sizes and capabilities.&lt;/p&gt;
&lt;h3&gt;Context Window Considerations&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Large codebase analysis, complex refactoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Feature development, code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cursor Small&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Quick edits, inline completions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For large projects, choose a model with a bigger context window so Cursor can include more codebase context without hitting limits. For simple edits, a smaller, faster model is more responsive.&lt;/p&gt;
&lt;h3&gt;How Cursor Assembles Context&lt;/h3&gt;
&lt;p&gt;When you send a message in Cursor&apos;s chat, the editor automatically assembles context by:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Including the active file&lt;/strong&gt; and your cursor position&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Including any @-mentioned files&lt;/strong&gt; or resources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Searching the workspace index&lt;/strong&gt; if @codebase is used&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loading applicable rules&lt;/strong&gt; from .cursor/rules/ based on the active file type&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Including recent conversation history&lt;/strong&gt; for continuity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adding any MCP server tool descriptions&lt;/strong&gt; for agent mode&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This automatic assembly is why Cursor often produces better results than manually pasting code into a generic chatbot. The context is structured and relevant, not random.&lt;/p&gt;
&lt;h3&gt;Context Budget Management&lt;/h3&gt;
&lt;p&gt;Each prompt has a context budget limited by the model&apos;s context window. When the budget is tight:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Be selective with @file mentions (reference only files directly relevant to the task)&lt;/li&gt;
&lt;li&gt;Use @codebase instead of @file for exploratory questions (it retrieves only relevant snippets)&lt;/li&gt;
&lt;li&gt;Keep rules files concise and targeted&lt;/li&gt;
&lt;li&gt;Start new chat sessions when switching topics&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Workspace Indexing Deep Dive&lt;/h2&gt;
&lt;p&gt;The workspace index is Cursor&apos;s most powerful context feature. It creates a semantic understanding of your entire codebase that powers @codebase searches and the agent&apos;s ability to navigate your project.&lt;/p&gt;
&lt;h3&gt;How Indexing Works&lt;/h3&gt;
&lt;p&gt;When you open a project in Cursor:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Cursor scans all files (respecting .gitignore)&lt;/li&gt;
&lt;li&gt;It creates embeddings (semantic representations) of code symbols, functions, and classes&lt;/li&gt;
&lt;li&gt;These embeddings are stored in a local index&lt;/li&gt;
&lt;li&gt;When you ask questions, Cursor searches this index for the most relevant code&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Indexing Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Let the index complete before starting work.&lt;/strong&gt; Look for the indexing indicator in the status bar.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-index after major changes.&lt;/strong&gt; If you merge a large branch or restructure directories, trigger a re-index.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trust the index.&lt;/strong&gt; @codebase search often finds more relevant code than you would think to include manually.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Practical Workflow Recommendations&lt;/h2&gt;
&lt;h3&gt;For New Projects&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Open the project in Cursor and let it index&lt;/li&gt;
&lt;li&gt;Create .cursor/rules/ with your core coding standards&lt;/li&gt;
&lt;li&gt;Add @Docs entries for the frameworks you are using&lt;/li&gt;
&lt;li&gt;Start with small tasks to verify Cursor understands your conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;For Team Adoption&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Check .cursor/rules/ into version control&lt;/li&gt;
&lt;li&gt;Agree on shared rule categories: Always rules for team-wide standards, Auto rules for language-specific patterns&lt;/li&gt;
&lt;li&gt;Add team documentation to @Docs&lt;/li&gt;
&lt;li&gt;Create Agent rules for specialized knowledge (deployment, database conventions)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;For Complex Features&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Start with @codebase to understand the existing implementation&lt;/li&gt;
&lt;li&gt;Use Composer for multi-file changes&lt;/li&gt;
&lt;li&gt;Reference @Docs for framework-specific guidance&lt;/li&gt;
&lt;li&gt;Use Debug Mode to quickly resolve implementation issues&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Notepads Pattern&lt;/h3&gt;
&lt;p&gt;Cursor&apos;s Notepads feature lets you create persistent context documents within the editor. Unlike .cursor/rules/ (which are loaded automatically), Notepads are reference documents you can @-mention when needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture decision records&lt;/li&gt;
&lt;li&gt;API specifications&lt;/li&gt;
&lt;li&gt;Design system documentation&lt;/li&gt;
&lt;li&gt;Onboarding guides for new team members&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Composer Pattern&lt;/h3&gt;
&lt;p&gt;Use Cursor&apos;s Composer (multi-file agent mode) for changes that span multiple files:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Describe the feature or change you want&lt;/li&gt;
&lt;li&gt;Composer plans modifications across relevant files&lt;/li&gt;
&lt;li&gt;Review the proposed changes&lt;/li&gt;
&lt;li&gt;Apply or reject each file modification individually&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Composer automatically assembles context from the workspace index, making it effective for cross-cutting changes.&lt;/p&gt;
&lt;h3&gt;The Rules Layering Strategy&lt;/h3&gt;
&lt;p&gt;Combine different rule types for comprehensive coverage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Always rules:&lt;/strong&gt; Universal team conventions (style, testing, documentation)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auto rules:&lt;/strong&gt; Language-specific standards (Python patterns, TypeScript patterns)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent rules:&lt;/strong&gt; Specialized knowledge (deployment procedures, database conventions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layering ensures the right context is active for the right task without overloading every interaction.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not creating .cursor/rules/.&lt;/strong&gt; Without rules, Cursor applies generic conventions that may not match your project. The rules are the single highest-impact configuration you can make.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring @codebase.&lt;/strong&gt; Many users manually specify files when @codebase would find the relevant code automatically. Trust the semantic search.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not indexing documentation.&lt;/strong&gt; If you are using a newer framework, @Docs with indexed documentation prevents the AI from relying on outdated training data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-specifying context.&lt;/strong&gt; If you include 20 files via @file when only 3 are relevant, you dilute the AI&apos;s attention. Use @codebase to let Cursor find the right files, or be selective with @file mentions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping the workspace indexing.&lt;/strong&gt; Let Cursor finish indexing your workspace on first open. The index powers @codebase and context assembly. Without it, context quality degrades significantly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Debug Mode.&lt;/strong&gt; When errors occur, Debug Mode provides structured error context that significantly improves the AI&apos;s diagnostic accuracy compared to manually pasting error messages.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for OpenWork: A Complete Guide to the Desktop AI Agent Framework</title><link>https://iceberglakehouse.com/posts/2026-03-context-openwork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-openwork/</guid><description>
OpenWork is a desktop-native AI agent framework designed for local, multi-step task execution on your computer. Unlike browser-based AI tools or term...</description><pubDate>Sat, 07 Mar 2026 19:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenWork is a desktop-native AI agent framework designed for local, multi-step task execution on your computer. Unlike browser-based AI tools or terminal agents, OpenWork operates as a desktop application that can interact with your file system, manage long-running sessions, and execute complex workflows autonomously. Its context management centers on Skills, session persistence, direct file system access, and a plugin architecture that extends its capabilities.&lt;/p&gt;
&lt;p&gt;This guide explains how to manage context effectively in OpenWork to delegate complex tasks, maintain continuity across sessions, and build reusable automation workflows.&lt;/p&gt;
&lt;h2&gt;How OpenWork Manages Context&lt;/h2&gt;
&lt;p&gt;OpenWork builds its context from several layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Skills&lt;/strong&gt; - predefined capability packages that define what the agent can do&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session state&lt;/strong&gt; - persistent history and progress tracking across interactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File system access&lt;/strong&gt; - direct read/write access to your local files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Plugin extensions&lt;/strong&gt; - additional capabilities including MCP server connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task definitions&lt;/strong&gt; - structured descriptions of multi-step workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your instructions&lt;/strong&gt; - natural language guidance provided at task creation&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key difference between OpenWork and other tools is its desktop-native design. It is built to interact with your operating system, not just with text in a terminal or browser. This means it can manage files, organize folders, process documents, and perform tasks that span multiple applications.&lt;/p&gt;
&lt;h2&gt;Skills: The Foundation of OpenWork&apos;s Capabilities&lt;/h2&gt;
&lt;p&gt;Skills in OpenWork define focused areas of expertise. Each Skill packages instructions, tools, and workflows into a reusable unit.&lt;/p&gt;
&lt;h3&gt;Built-In Skills&lt;/h3&gt;
&lt;p&gt;OpenWork ships with core Skills for common tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;File Management:&lt;/strong&gt; Organizing, renaming, moving, and transforming files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document Processing:&lt;/strong&gt; Reading, summarizing, and creating documents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Analysis:&lt;/strong&gt; Processing spreadsheets, CSVs, and structured data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web Research:&lt;/strong&gt; Gathering information from web sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code Assistance:&lt;/strong&gt; Writing, reviewing, and refactoring code&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Creating Custom Skills&lt;/h3&gt;
&lt;p&gt;Define custom Skills that match your specific workflows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Skill: Monthly Report Generator

## Purpose

Generate monthly departmental reports by combining data from multiple sources.

## Inputs Required

- Sales data CSV from /data/sales/
- Customer feedback file from /data/feedback/
- Team metrics from /data/team/

## Process

1. Read and validate all input files
2. Calculate key metrics (revenue, growth, satisfaction scores)
3. Generate narrative summary for each section
4. Format the report using the template in /templates/monthly-report.md
5. Save to /reports/YYYY-MM-monthly-report.md

## Quality Checks

- All numerical values must be sourced from the input data
- The report must include year-over-year comparisons
- Format all currency values with two decimal places
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Skill Selection and Context&lt;/h3&gt;
&lt;p&gt;When you assign a task, OpenWork selects the relevant Skills based on the task description. The selected Skills become part of the active context, giving the agent the specific instructions it needs for that type of work. This means well-defined Skills reduce the amount of context you need to provide in each task description.&lt;/p&gt;
&lt;h2&gt;Session Management and Persistence&lt;/h2&gt;
&lt;p&gt;OpenWork maintains persistent sessions that carry context across interactions. This is critical for multi-step tasks that span hours or days. Unlike web-based AI tools where closing the browser tab loses your conversation state, OpenWork sessions are durably stored on your local machine and survive application restarts.&lt;/p&gt;
&lt;h3&gt;Session State&lt;/h3&gt;
&lt;p&gt;Each session tracks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Conversation history:&lt;/strong&gt; Every instruction and response&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File operations:&lt;/strong&gt; What files were read, created, or modified&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task progress:&lt;/strong&gt; Current step in multi-step workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent decisions:&lt;/strong&gt; Why specific actions were taken (for auditability)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Resuming Sessions&lt;/h3&gt;
&lt;p&gt;When you return to OpenWork after closing it, your sessions are preserved. You can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Continue where you left off on an interrupted task&lt;/li&gt;
&lt;li&gt;Review what the agent did while you were away (for scheduled tasks)&lt;/li&gt;
&lt;li&gt;Provide additional instructions based on completed work&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Starting Fresh&lt;/h3&gt;
&lt;p&gt;For unrelated work, start a new session. Carrying over context from a previous project creates noise that degrades the agent&apos;s focus.&lt;/p&gt;
&lt;h2&gt;File System Access: Direct Local Interaction&lt;/h2&gt;
&lt;p&gt;OpenWork&apos;s direct file system access is one of its primary context advantages. The agent reads files in real time (not from uploaded snapshots) and writes output directly to your file system.&lt;/p&gt;
&lt;h3&gt;Context from Your File System&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Project structures:&lt;/strong&gt; The agent can browse directories to understand organization&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document contents:&lt;/strong&gt; Read any text-based file without manual copying&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data files:&lt;/strong&gt; Process CSVs, JSON files, and other structured data in place&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Read settings files to understand tool configurations&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Organize files before delegating.&lt;/strong&gt; A well-structured file system gives OpenWork better context than a messy one.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use descriptive file names.&lt;/strong&gt; &lt;code&gt;q3-revenue-analysis.csv&lt;/code&gt; gives the agent more context than &lt;code&gt;data2.csv&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a dedicated working directory&lt;/strong&gt; for each project or task category.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Store templates&lt;/strong&gt; in a consistent location so Skills can reference them.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Support Through Plugins&lt;/h2&gt;
&lt;p&gt;OpenWork supports MCP servers through its plugin architecture, enabling connections to external data sources and tools.&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database integration:&lt;/strong&gt; Let OpenWork query databases for report generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud storage:&lt;/strong&gt; Access files in Google Drive, OneDrive, or S3&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API integration:&lt;/strong&gt; Connect to internal services for data retrieval&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Communication tools:&lt;/strong&gt; Draft messages or pull context from Slack, email, or other platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured through OpenWork&apos;s settings panel. Each server connection becomes available as a tool that Skills can utilize.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Simple Tasks (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For straightforward file operations (&amp;quot;Rename all files in /downloads/ to include today&apos;s date&amp;quot;), the task description and file system access provide sufficient context.&lt;/p&gt;
&lt;h3&gt;Moderate Tasks&lt;/h3&gt;
&lt;p&gt;For tasks requiring judgment (&amp;quot;Review the documents in /contracts/ and flag any that expire within 30 days&amp;quot;), provide the criteria and desired output format. OpenWork will use its Skills and file access to execute.&lt;/p&gt;
&lt;h3&gt;Complex Tasks&lt;/h3&gt;
&lt;p&gt;For multi-step workflows (&amp;quot;Create a quarterly business review presentation from data in three different folders, following the template in /templates/&amp;quot;), invest in a detailed task definition and ensure the relevant Skills are configured.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Effective Delegation&lt;/h2&gt;
&lt;h3&gt;The Briefing Document Approach&lt;/h3&gt;
&lt;p&gt;For complex tasks, create a briefing document that OpenWork reads before starting:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Task Briefing: Q3 Performance Analysis

## Objective

Create a comprehensive performance analysis comparing Q3 results
against Q2 and the same quarter last year.

## Data Sources

- /data/revenue/q3-2026.csv (primary revenue data)
- /data/revenue/q2-2026.csv (previous quarter)
- /data/revenue/q3-2025.csv (year-over-year comparison)
- /data/kpis/team-metrics.json (operational metrics)

## Required Sections

1. Executive Summary (250 words max)
2. Revenue Analysis with charts
3. Year-over-Year Comparison
4. Team Performance Metrics
5. Recommendations

## Formatting

- Use the template at /templates/quarterly-analysis.md
- All percentages to one decimal place
- Currency in USD with comma separators
- Charts as ASCII/text-based tables

## Quality Standards

- Every claim must reference a specific data point
- Include both absolute and percentage change figures
- Flag any anomalies or data gaps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This structured briefing gives OpenWork comprehensive context without relying on interactive conversation.&lt;/p&gt;
&lt;h3&gt;The Progressive Detail Pattern&lt;/h3&gt;
&lt;p&gt;Provide context in layers, starting broad and getting specific:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;High-level goal:&lt;/strong&gt; &amp;quot;Create a monthly financial report&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specific requirements:&lt;/strong&gt; &amp;quot;Include revenue, costs, and margin analysis&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data locations:&lt;/strong&gt; &amp;quot;Source data is in /finance/monthly/&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quality criteria:&lt;/strong&gt; &amp;quot;All numbers must reconcile with the source data&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Output format:&lt;/strong&gt; &amp;quot;Follow the template in /templates/&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each layer adds specificity without contradicting previous layers.&lt;/p&gt;
&lt;h2&gt;Multi-Agent Coordination&lt;/h2&gt;
&lt;p&gt;OpenWork can coordinate multiple agents working on related but independent tasks:&lt;/p&gt;
&lt;h3&gt;Parallel Execution&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Agent 1:&lt;/strong&gt; Processes financial data and creates charts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent 2:&lt;/strong&gt; Summarizes customer feedback from text files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent 3:&lt;/strong&gt; Compiles operational metrics from log files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each agent works with its own focused context, and the results are combined into a final deliverable.&lt;/p&gt;
&lt;h3&gt;Sequential Handoffs&lt;/h3&gt;
&lt;p&gt;For workflows where each step depends on the previous one:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Agent A produces raw analysis&lt;/li&gt;
&lt;li&gt;Agent B reviews and refines the analysis&lt;/li&gt;
&lt;li&gt;Agent C formats the final output&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The context from each step flows to the next, creating a pipeline of increasingly refined output.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;OpenWork can read PDFs directly from your file system. Use PDFs for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Existing reports and documents that are already in PDF format&lt;/li&gt;
&lt;li&gt;External specifications or contracts received from others&lt;/li&gt;
&lt;li&gt;Formatted documents where layout matters&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown&lt;/h3&gt;
&lt;p&gt;For documents you create specifically for OpenWork (templates, instructions, style guides), use Markdown. It parses more reliably and is easier for the agent to reference precisely.&lt;/p&gt;
&lt;h3&gt;The File-Based Advantage&lt;/h3&gt;
&lt;p&gt;Because OpenWork accesses files directly (not through uploads), the format matters less than it does for web-based tools. Both PDFs and Markdown are readable from the file system. Choose based on the source: use the original format for received documents, and Markdown for documents you author.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Scheduled Workflow Pattern&lt;/h3&gt;
&lt;p&gt;Set up recurring tasks that OpenWork executes on a schedule:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Define the task with clear inputs, processes, and outputs&lt;/li&gt;
&lt;li&gt;Schedule it to run at a specific time (daily, weekly, monthly)&lt;/li&gt;
&lt;li&gt;OpenWork executes the task autonomously and saves the results&lt;/li&gt;
&lt;li&gt;Review the output when convenient&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is ideal for report generation, data processing, file organization, and routine maintenance tasks.&lt;/p&gt;
&lt;h3&gt;The Multi-Step Pipeline Pattern&lt;/h3&gt;
&lt;p&gt;Chain multiple Skills into a pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Step 1 (Data Collection):&lt;/strong&gt; Gather data from multiple sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 2 (Processing):&lt;/strong&gt; Clean, transform, and analyze the data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 3 (Generation):&lt;/strong&gt; Create the output document or presentation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 4 (Verification):&lt;/strong&gt; Check the output against quality criteria&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step builds on the context from previous steps, creating a coherent end-to-end workflow.&lt;/p&gt;
&lt;h3&gt;The Delegation Escalation Pattern&lt;/h3&gt;
&lt;p&gt;Start with simple delegations and gradually increase complexity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; File organization and simple document creation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 2:&lt;/strong&gt; Data processing and report generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 3:&lt;/strong&gt; Multi-source research and synthesis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 4:&lt;/strong&gt; Fully automated recurring workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This builds your confidence in OpenWork&apos;s handling of context while gradually training the agent (through Skills and session history) on your specific needs.&lt;/p&gt;
&lt;h2&gt;When to Use OpenWork vs. Other Tools&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use OpenWork when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your tasks involve desktop-level file management&lt;/li&gt;
&lt;li&gt;You need multi-step autonomous execution&lt;/li&gt;
&lt;li&gt;You want scheduled, recurring task automation&lt;/li&gt;
&lt;li&gt;Your work is document-centric (reports, presentations, data processing)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a terminal agent (Claude Code, Gemini CLI, OpenCode) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your work is code-centric&lt;/li&gt;
&lt;li&gt;You need direct terminal command execution&lt;/li&gt;
&lt;li&gt;You want inline access to compilers, test runners, and build tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a web-based tool (ChatGPT, Claude Web) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need interactive conversation and brainstorming&lt;/li&gt;
&lt;li&gt;The task is primarily knowledge-based&lt;/li&gt;
&lt;li&gt;You do not need local file system access&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vague task descriptions.&lt;/strong&gt; &amp;quot;Work on my files&amp;quot; gives OpenWork nothing to execute. Specify what files, what action, and what output you expect.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping Skills for repeatable work.&lt;/strong&gt; If you delegate the same type of task more than twice, create a Skill for it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing autonomous output.&lt;/strong&gt; Scheduled tasks run without supervision. Always review the results, especially during the first few runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Disorganized file systems.&lt;/strong&gt; OpenWork&apos;s effectiveness depends on finding and understanding your files. Messy directories produce messy results.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-scoping single tasks.&lt;/strong&gt; Break large projects into multiple tasks with clear handoff points. OpenWork handles focused, well-defined tasks better than vague, sweeping ones.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not leveraging session persistence.&lt;/strong&gt; If a task is partially complete, resume the session rather than starting over. The carried context improves continuity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for OpenCode: A Complete Guide to the Open-Source Terminal AI Agent</title><link>https://iceberglakehouse.com/posts/2026-03-context-opencode/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-opencode/</guid><description>
OpenCode is an open-source terminal-based AI coding agent that prioritizes privacy, local-first operation, and broad model provider support. Built as...</description><pubDate>Sat, 07 Mar 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenCode is an open-source terminal-based AI coding agent that prioritizes privacy, local-first operation, and broad model provider support. Built as a TUI (terminal user interface) application, it runs entirely in your terminal and supports dozens of LLM providers from OpenAI and Anthropic to local models through Ollama. Its context management system is built around configuration files, session persistence, MCP integration, and a dual-agent architecture that separates planning from code generation.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism OpenCode offers and explains how to configure them for effective development workflows, regardless of which model provider you choose.&lt;/p&gt;
&lt;h2&gt;The TUI Advantage for Context Management&lt;/h2&gt;
&lt;p&gt;OpenCode&apos;s TUI (Terminal User Interface) provides a structured visual interface within your terminal. Unlike bare CLI tools where you interact through plain text, the TUI offers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A conversation panel showing the full history with syntax-highlighted code blocks&lt;/li&gt;
&lt;li&gt;A file browser for navigating your project structure&lt;/li&gt;
&lt;li&gt;A status bar showing the active model, session state, and token usage&lt;/li&gt;
&lt;li&gt;Visual indicators for agent mode (Plan vs. Build)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The TUI makes context management more tangible because you can see what the agent is working with. Token usage indicators help you understand when you are approaching context limits, and the session panel lets you manage conversation history visually.&lt;/p&gt;
&lt;h2&gt;How OpenCode Manages Context&lt;/h2&gt;
&lt;p&gt;OpenCode assembles its context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;opencode.json&lt;/strong&gt; - project-level configuration and instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session history&lt;/strong&gt; - SQLite-backed persistent sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; - external tools and data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LSP (Language Server Protocol)&lt;/strong&gt; integration - real-time code intelligence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The codebase&lt;/strong&gt; - files, directories, and project structure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom commands&lt;/strong&gt; - user-defined reusable operations&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What distinguishes OpenCode from other terminal agents is its architectural separation between &amp;quot;Build&amp;quot; and &amp;quot;Plan&amp;quot; agents. The Build agent writes code and makes changes. The Plan agent reasons about architecture and strategy without modifying files. This separation affects how you structure context: planning tasks need architectural context, while building tasks need implementation detail.&lt;/p&gt;
&lt;h2&gt;opencode.json: Project Configuration&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;opencode.json&lt;/code&gt; file in your project root is the primary configuration mechanism. It defines provider settings, model selection, and project-specific context.&lt;/p&gt;
&lt;h3&gt;Basic Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;$schema&amp;quot;: &amp;quot;https://opencode.ai/config.schema.json&amp;quot;,
  &amp;quot;provider&amp;quot;: {
    &amp;quot;name&amp;quot;: &amp;quot;anthropic&amp;quot;,
    &amp;quot;model&amp;quot;: &amp;quot;claude-sonnet-4.5&amp;quot;
  },
  &amp;quot;context&amp;quot;: {
    &amp;quot;instructions&amp;quot;: &amp;quot;This is a Python FastAPI application with PostgreSQL. Use Ruff for linting and pytest for testing. Follow PEP 8 strictly.&amp;quot;,
    &amp;quot;include&amp;quot;: [&amp;quot;src/&amp;quot;, &amp;quot;tests/&amp;quot;, &amp;quot;docs/&amp;quot;],
    &amp;quot;exclude&amp;quot;: [&amp;quot;*.pyc&amp;quot;, &amp;quot;__pycache__/&amp;quot;, &amp;quot;.venv/&amp;quot;]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Context Instructions&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;context.instructions&lt;/code&gt; field functions like CLAUDE.md or GEMINI.md for other tools. Include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your technology stack and versions&lt;/li&gt;
&lt;li&gt;Coding conventions and style preferences&lt;/li&gt;
&lt;li&gt;Testing strategy and framework&lt;/li&gt;
&lt;li&gt;Architecture decisions and patterns&lt;/li&gt;
&lt;li&gt;Build and deployment commands&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Include and Exclude Patterns&lt;/h3&gt;
&lt;p&gt;Control what OpenCode sees by specifying include and exclude patterns. This focuses the agent&apos;s attention on relevant code and prevents it from wasting context on generated files, dependencies, or build artifacts.&lt;/p&gt;
&lt;h3&gt;Provider Flexibility&lt;/h3&gt;
&lt;p&gt;OpenCode supports a wide range of providers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Models&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4o, o3, etc.&lt;/td&gt;
&lt;td&gt;Cloud-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Sonnet, Opus&lt;/td&gt;
&lt;td&gt;Cloud-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini Pro, Flash&lt;/td&gt;
&lt;td&gt;Cloud-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Llama, Mistral, etc.&lt;/td&gt;
&lt;td&gt;Local, private&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Many models&lt;/td&gt;
&lt;td&gt;Multi-provider routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom endpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This flexibility means you can choose the right model for your context needs. Local models through Ollama keep all context on your machine. Cloud models provide more capability but send your context to external servers.&lt;/p&gt;
&lt;h2&gt;The Dual-Agent Architecture: Build vs. Plan&lt;/h2&gt;
&lt;p&gt;OpenCode&apos;s most distinctive context management feature is its separation of planning and execution into two independent agents.&lt;/p&gt;
&lt;h3&gt;The Plan Agent&lt;/h3&gt;
&lt;p&gt;The Plan agent reasons about architecture, strategy, and design without making any file changes. Use it for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyzing a codebase before making changes&lt;/li&gt;
&lt;li&gt;Designing an implementation approach&lt;/li&gt;
&lt;li&gt;Evaluating tradeoffs between different solutions&lt;/li&gt;
&lt;li&gt;Understanding unfamiliar code&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Plan agent receives the same project context (opencode.json, codebase, MCP) but operates in a read-only mode. This is valuable because it means you can explore and discuss ideas without risk of unintended changes.&lt;/p&gt;
&lt;h3&gt;The Build Agent&lt;/h3&gt;
&lt;p&gt;The Build agent writes code, creates files, runs commands, and makes changes to your project. It uses the planning context plus implementation-specific details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The specific files that need modification&lt;/li&gt;
&lt;li&gt;Test commands to verify changes&lt;/li&gt;
&lt;li&gt;Style and formatting requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Switching Between Agents&lt;/h3&gt;
&lt;p&gt;Switch between Plan and Build during a session to match the current need:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start with Plan:&lt;/strong&gt; &amp;quot;Analyze the authentication module and suggest how to add OAuth support&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review the plan:&lt;/strong&gt; Evaluate the agent&apos;s architectural proposal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Switch to Build:&lt;/strong&gt; &amp;quot;Implement the OAuth integration following the approach you described&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This two-phase approach prevents the common problem of AI agents diving into implementation before understanding the architecture.&lt;/p&gt;
&lt;h2&gt;Session Persistence&lt;/h2&gt;
&lt;p&gt;OpenCode uses SQLite to persist session data across terminal sessions. This means you can close your terminal, come back later, and pick up where you left off.&lt;/p&gt;
&lt;h3&gt;What Gets Persisted&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Conversation history (messages and responses)&lt;/li&gt;
&lt;li&gt;File changes made during the session&lt;/li&gt;
&lt;li&gt;Agent state (Plan vs. Build mode)&lt;/li&gt;
&lt;li&gt;Active context (which files were being discussed)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Session Management&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Start a new session for unrelated work&lt;/li&gt;
&lt;li&gt;Continue an existing session when resuming previous work&lt;/li&gt;
&lt;li&gt;Clear session history when accumulated context becomes counterproductive&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context Compaction&lt;/h3&gt;
&lt;p&gt;For long sessions, OpenCode supports context compaction. This summarizes older conversation history to free up context window space while retaining the essential information. Compaction is automatic and configurable: you can control how aggressively it summarizes based on your model&apos;s context window size.&lt;/p&gt;
&lt;p&gt;This is particularly important when using models with smaller context windows (like local Ollama models with 8K or 32K contexts) where every token counts. Cloud models with 128K or 200K windows have much more room, but even they benefit from compaction during extended sessions.&lt;/p&gt;
&lt;h3&gt;Context Window Management Across Providers&lt;/h3&gt;
&lt;p&gt;Different providers offer different context window sizes, and your strategy should adapt:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider Tier&lt;/th&gt;
&lt;th&gt;Context Size&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Small (8K-32K)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama local models&lt;/td&gt;
&lt;td&gt;Aggressive compaction, focused sessions, minimal background context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Medium (64K-128K)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4o, Claude Sonnet&lt;/td&gt;
&lt;td&gt;Standard compaction, moderate session length, room for codebase context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Large (200K+)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Opus, Gemini Pro&lt;/td&gt;
&lt;td&gt;Minimal compaction needed, can handle long sessions with extensive context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Understanding your working model&apos;s context limit helps you decide how much context to load via &lt;code&gt;opencode.json&lt;/code&gt; versus providing interactively. With a small local model, lean heavily on precise &lt;code&gt;include&lt;/code&gt; patterns to keep only the most relevant files in context. With a large cloud model, you can afford broader context.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;OpenCode supports MCP through the &lt;code&gt;opencode mcp&lt;/code&gt; command, providing integration with external tools and data.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Add an MCP server
opencode mcp add my-db-server -- npx @my-org/db-mcp-server

# List configured servers
opencode mcp list

# Remove a server
opencode mcp remove my-db-server
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;MCP servers can also be configured in &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;servers&amp;quot;: {
      &amp;quot;filesystem&amp;quot;: {
        &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
        &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-filesystem&amp;quot;, &amp;quot;./&amp;quot;]
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP with OpenCode&lt;/h3&gt;
&lt;p&gt;The same principles apply as with other terminal agents: use MCP when the task requires data from outside the codebase (databases, APIs, external services). For code-only work, OpenCode&apos;s built-in file access is sufficient.&lt;/p&gt;
&lt;p&gt;One consideration specific to OpenCode: if you are using a local model through Ollama, MCP adds server-side processing that runs locally. There is no additional privacy concern since everything stays on your machine.&lt;/p&gt;
&lt;h2&gt;LSP Integration: Real-Time Code Intelligence&lt;/h2&gt;
&lt;p&gt;OpenCode integrates with Language Server Protocol services to provide richer code context. LSP gives the agent:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Type information and function signatures&lt;/li&gt;
&lt;li&gt;Import resolution and dependency tracking&lt;/li&gt;
&lt;li&gt;Error and warning diagnostics from your language&apos;s toolchain&lt;/li&gt;
&lt;li&gt;Symbol navigation and reference finding&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means OpenCode understands your code at a deeper level than simple text analysis. When you ask about a function, the agent knows its type signature, where it is called from, and what it depends on.&lt;/p&gt;
&lt;h3&gt;Why LSP Matters for Context&lt;/h3&gt;
&lt;p&gt;LSP provides structured context that would otherwise require the agent to infer from raw code. Knowing that a variable is of type &lt;code&gt;List[UserModel]&lt;/code&gt; is more precise than the agent guessing from how the variable is used. This structured understanding reduces errors and produces more accurate code generation.&lt;/p&gt;
&lt;h2&gt;Custom Commands&lt;/h2&gt;
&lt;p&gt;OpenCode supports user-defined custom commands that encapsulate common operations with predefined context:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;commands&amp;quot;: {
    &amp;quot;review&amp;quot;: {
      &amp;quot;description&amp;quot;: &amp;quot;Review the current branch for issues&amp;quot;,
      &amp;quot;prompt&amp;quot;: &amp;quot;Review all changes in the current branch compared to main. Check for: security issues, performance problems, missing error handling, and test coverage gaps.&amp;quot;
    },
    &amp;quot;test-all&amp;quot;: {
      &amp;quot;description&amp;quot;: &amp;quot;Run and analyze the full test suite&amp;quot;,
      &amp;quot;prompt&amp;quot;: &amp;quot;Run the complete test suite. Report any failures, flaky tests, or tests that take unusually long. Suggest fixes for any failures.&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Custom commands combine a descriptive name with a predefined prompt, creating reusable context bundles for common workflows.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in OpenCode&lt;/h2&gt;
&lt;h3&gt;Minimal Context&lt;/h3&gt;
&lt;p&gt;For quick questions about the codebase, just ask. OpenCode will explore files as needed.&lt;/p&gt;
&lt;h3&gt;Moderate Context&lt;/h3&gt;
&lt;p&gt;For feature work, set up your &lt;code&gt;opencode.json&lt;/code&gt; with clear instructions and use the Plan agent first to establish understanding before switching to Build.&lt;/p&gt;
&lt;h3&gt;Heavy Context&lt;/h3&gt;
&lt;p&gt;For complex refactoring or architectural changes, combine: detailed &lt;code&gt;opencode.json&lt;/code&gt; instructions, the Plan agent for architecture analysis, MCP servers for database or service context, and custom commands for verification steps.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is Preferred&lt;/h3&gt;
&lt;p&gt;OpenCode works with text-based formats. Project context documents, architecture decision records, and coding standards should be Markdown files in your repository.&lt;/p&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;If you have reference material in PDF format, convert the relevant sections to Markdown. OpenCode does not have built-in PDF parsing, so text-based formats are more reliable.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Privacy-First Development Pattern&lt;/h3&gt;
&lt;p&gt;Use Ollama with a local model for sensitive codebases:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Ollama and download a capable model (Llama 3.1, Mistral Large, etc.)&lt;/li&gt;
&lt;li&gt;Configure &lt;code&gt;opencode.json&lt;/code&gt; to use the local Ollama endpoint&lt;/li&gt;
&lt;li&gt;All context stays on your machine with zero network calls&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly valuable for proprietary code, pre-launch features, or security-sensitive applications.&lt;/p&gt;
&lt;h3&gt;The Plan-Then-Build Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Start with the Plan agent to analyze the codebase&lt;/li&gt;
&lt;li&gt;Discuss the architecture and design approach&lt;/li&gt;
&lt;li&gt;Switch to Build once you agree on the plan&lt;/li&gt;
&lt;li&gt;Use custom commands to verify the implementation&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Multi-Provider Context Strategy&lt;/h3&gt;
&lt;p&gt;Use different providers for different context needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A large cloud model (GPT-4o, Claude Opus) for complex architectural planning&lt;/li&gt;
&lt;li&gt;A fast, small model for quick edits and simple tasks&lt;/li&gt;
&lt;li&gt;A local model for sensitive code that should not leave your machine&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Switch providers in &lt;code&gt;opencode.json&lt;/code&gt; based on the current task.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not configuring opencode.json.&lt;/strong&gt; Without it, OpenCode has no project context beyond what it can infer from file exploration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using Build when you should Plan.&lt;/strong&gt; Jumping to code changes without planning leads to rework. Use the Plan agent first for anything non-trivial.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring context compaction.&lt;/strong&gt; With smaller model context windows, long sessions degrade quality. Let compaction do its job, or start fresh sessions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not leveraging LSP.&lt;/strong&gt; Ensure your language&apos;s LSP server is installed and running. The structured code intelligence significantly improves agent accuracy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping custom commands for repeated tasks.&lt;/strong&gt; If you run the same kind of review or test analysis frequently, create a custom command.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using cloud models for sensitive code without consideration.&lt;/strong&gt; If code privacy matters, use Ollama with local models. The trade-off is sometimes reduced capability, but the privacy guarantee is absolute.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Google Antigravity: A Complete Guide to the Agent-First IDE</title><link>https://iceberglakehouse.com/posts/2026-03-context-google-antigravity/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-google-antigravity/</guid><description>
Google Antigravity is an agent-first IDE built by Google DeepMind&apos;s Advanced Agentic Coding team. It approaches context management differently from o...</description><pubDate>Sat, 07 Mar 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google Antigravity is an agent-first IDE built by Google DeepMind&apos;s Advanced Agentic Coding team. It approaches context management differently from other AI coding tools because it is designed from the ground up around agentic workflows, where the AI is not just an assistant responding to prompts, but an autonomous agent that plans, executes, tracks progress, and retains knowledge across sessions. Its context management system centers on three pillars: Skills for reusable capability, Knowledge Items for persistent memory, and Artifacts for transparent documentation of its work.&lt;/p&gt;
&lt;p&gt;This guide covers how to structure and manage context in Antigravity to get the most from its agentic capabilities.&lt;/p&gt;
&lt;h2&gt;How Antigravity Manages Context&lt;/h2&gt;
&lt;p&gt;Antigravity assembles its working context from multiple sources, layered by persistence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Knowledge Items (KIs)&lt;/strong&gt; - persistent, distilled knowledge from past conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills&lt;/strong&gt; (SKILL.md files) - reusable instruction sets for specific capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workflows&lt;/strong&gt; - step-by-step guides in the &lt;code&gt;.agents/workflows/&lt;/code&gt; directory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - the current and past interactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The codebase&lt;/strong&gt; - files, directories, and project structure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers&lt;/strong&gt; - external tools and data sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task artifacts&lt;/strong&gt; - implementation plans, walkthroughs, and checklists the AI creates&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What makes Antigravity distinctive is that it actively generates and maintains its own context artifacts. The AI creates task checklists, implementation plans, and walkthroughs as it works, and these become part of the persistent context for future sessions.&lt;/p&gt;
&lt;h2&gt;Skills: Reusable Capability Packages&lt;/h2&gt;
&lt;p&gt;Skills are Antigravity&apos;s primary mechanism for defining reusable capabilities. Each Skill is a folder containing a &lt;code&gt;SKILL.md&lt;/code&gt; file with YAML frontmatter and detailed Markdown instructions.&lt;/p&gt;
&lt;h3&gt;Skill Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.agents/skills/
  my-skill/
    SKILL.md          # Required: instructions with YAML frontmatter
    scripts/          # Optional: helper scripts
    examples/         # Optional: reference implementations
    resources/        # Optional: templates or assets
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SKILL.md Format&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: deploy-to-staging
description: Deploy the application to the staging environment
---

## Prerequisites

- Docker must be installed and running
- AWS CLI must be configured with staging credentials
- The current branch must have passing CI

## Steps

1. Build the Docker image with the staging configuration
2. Push the image to ECR
3. Update the ECS task definition
4. Trigger the deployment
5. Verify the health check endpoint responds

## Verification

- Check that the /health endpoint returns 200
- Verify the deployed version matches the expected Git SHA
- Run the smoke test suite against staging
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Create Skills&lt;/h3&gt;
&lt;p&gt;Create a Skill when you have a workflow that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You perform more than once&lt;/li&gt;
&lt;li&gt;Requires specific steps in a specific order&lt;/li&gt;
&lt;li&gt;Benefits from consistent execution across team members&lt;/li&gt;
&lt;li&gt;Involves domain knowledge that is not obvious from the codebase alone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Skills vs. Other Context Mechanisms&lt;/h3&gt;
&lt;p&gt;Skills are for procedural knowledge (&amp;quot;how to do X&amp;quot;). They differ from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Knowledge Items&lt;/strong&gt; which store factual knowledge (&amp;quot;what is X&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GEMINI.md or CLAUDE.md style files&lt;/strong&gt; which provide ambient project context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Artifacts&lt;/strong&gt; which document specific work done in a specific session&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Knowledge Items: Persistent Memory Across Conversations&lt;/h2&gt;
&lt;p&gt;Knowledge Items (KIs) are Antigravity&apos;s mechanism for retaining knowledge across conversations. Unlike conversation history (which is session-bound), KIs are distilled, curated facts that persist indefinitely.&lt;/p&gt;
&lt;h3&gt;How KIs Work&lt;/h3&gt;
&lt;p&gt;At the end of each conversation, a separate Knowledge Subagent analyzes the conversation and extracts key information into KIs. Each KI has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;metadata.json&lt;/strong&gt;: summary, timestamps, references to original conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;artifacts/&lt;/strong&gt;: related files, documentation, and analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;KIs are stored in the Knowledge directory and are automatically loaded when starting new conversations. Antigravity checks KI summaries at the beginning of every session to avoid redundant work.&lt;/p&gt;
&lt;h3&gt;What Gets Stored as KIs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Architecture decisions and their rationale&lt;/li&gt;
&lt;li&gt;Troubleshooting discoveries and resolutions&lt;/li&gt;
&lt;li&gt;Implementation patterns specific to your project&lt;/li&gt;
&lt;li&gt;Configuration details and their implications&lt;/li&gt;
&lt;li&gt;Integration specifics for external services&lt;/li&gt;
&lt;li&gt;Performance characteristics and optimization strategies&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Using KIs Effectively&lt;/h3&gt;
&lt;p&gt;The most important rule for KIs is: &lt;strong&gt;always check them before starting research.&lt;/strong&gt; If you are about to analyze a codebase module, check whether a KI already covers that analysis. This prevents redundant work and ensures continuity across sessions.&lt;/p&gt;
&lt;p&gt;You can also reference specific KIs in conversations by pointing Antigravity at the KI&apos;s artifact files. This is especially useful when building on previous work or when onboarding new team members who can review the accumulated KIs.&lt;/p&gt;
&lt;h2&gt;Artifacts: The Transparency System&lt;/h2&gt;
&lt;p&gt;Antigravity creates artifacts as structured Markdown documents that make the agent&apos;s work transparent and reviewable. Key artifact types include:&lt;/p&gt;
&lt;h3&gt;task.md&lt;/h3&gt;
&lt;p&gt;A checklist that tracks progress on the current task. Antigravity creates this at the start of complex work and updates it as it progresses:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Feature: User Authentication

- [x] Research existing auth patterns
- [x] Create implementation plan
- [/] Implement JWT token generation
- [ ] Add refresh token support
- [ ] Write integration tests
- [ ] Update API documentation
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;implementation_plan.md&lt;/h3&gt;
&lt;p&gt;Created during the PLANNING phase, this documents the proposed changes, file modifications, and verification strategy before any code is written. You review and approve (or modify) this plan before Antigravity proceeds to execution.&lt;/p&gt;
&lt;h3&gt;walkthrough.md&lt;/h3&gt;
&lt;p&gt;Created after completing work, this documents what was accomplished, what was tested, and the results. It serves as a record of the work and can be reviewed by team members.&lt;/p&gt;
&lt;h3&gt;Why Artifacts Matter for Context&lt;/h3&gt;
&lt;p&gt;Artifacts create a structured record that Antigravity can reference in future sessions. When you return to a project, the agent can read the previous implementation plan and walkthrough to understand what was done and why. This is far more efficient than re-analyzing the codebase from scratch.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in Antigravity&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Tasks)&lt;/h3&gt;
&lt;p&gt;For simple questions or small fixes, just ask. Antigravity can explore the codebase, read relevant files, and provide answers without additional setup. Its file exploration tools are fast and respect &lt;code&gt;.gitignore&lt;/code&gt; patterns.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Work)&lt;/h3&gt;
&lt;p&gt;For typical feature development, let Antigravity&apos;s Planning phase do the heavy lifting. It will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Analyze the codebase to understand the current architecture&lt;/li&gt;
&lt;li&gt;Create an implementation plan for your review&lt;/li&gt;
&lt;li&gt;Execute the plan once approved&lt;/li&gt;
&lt;li&gt;Verify the changes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The PLANNING &amp;gt; EXECUTION &amp;gt; VERIFICATION workflow is built into Antigravity&apos;s DNA, and each phase generates artifacts that carry context forward.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Ongoing Projects)&lt;/h3&gt;
&lt;p&gt;For sustained work across multiple sessions, invest in Skills and ensure KIs are accumulating properly. Over time, Antigravity builds a rich knowledge base about your project that makes each subsequent session more productive.&lt;/p&gt;
&lt;h2&gt;Multi-Model Support and Context Routing&lt;/h2&gt;
&lt;p&gt;Antigravity supports multiple AI models and can use different models for different subtasks. This means context management extends to model selection: some tasks benefit from larger context windows, while others benefit from faster inference.&lt;/p&gt;
&lt;p&gt;The agent handles this transparently, but being aware of it helps you understand why some responses might take longer (larger model processing more context) while others are faster (smaller model handling a focused subtask).&lt;/p&gt;
&lt;h2&gt;Browser Recording and Visual Context&lt;/h2&gt;
&lt;p&gt;Antigravity includes a built-in browser interaction system that records all browser actions as WebP videos. This creates a unique form of context: visual proof of work that can be reviewed later.&lt;/p&gt;
&lt;p&gt;For frontend development, this means Antigravity can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Navigate to web applications and interact with UI elements&lt;/li&gt;
&lt;li&gt;Take screenshots to verify visual changes&lt;/li&gt;
&lt;li&gt;Record step-by-step interactions for documentation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These recordings become part of the walkthrough artifact, providing visual evidence that changes work as intended.&lt;/p&gt;
&lt;h2&gt;Conversation History and Context Summaries&lt;/h2&gt;
&lt;p&gt;Antigravity maintains conversation logs and summaries that persist across sessions. When you start a new conversation, the system provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Summaries of recent conversations&lt;/li&gt;
&lt;li&gt;KI summaries with artifact paths&lt;/li&gt;
&lt;li&gt;Information about previously edited and viewed files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means Antigravity starts each session with awareness of what happened in recent sessions, reducing the need to re-explain context that was covered before.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Antigravity supports MCP servers for connecting to external tools and data sources. Configuration follows the standard MCP pattern familiar from other tools.&lt;/p&gt;
&lt;h3&gt;Practical Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database access:&lt;/strong&gt; Let Antigravity query your development database to understand schema and data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Browser automation:&lt;/strong&gt; Verify frontend changes visually&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Git hosting:&lt;/strong&gt; Interact with GitHub or GitLab for PR management&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation systems:&lt;/strong&gt; Access internal wikis or knowledge bases&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;Use MCP when the task requires information from outside the codebase. For code-only work, Antigravity&apos;s built-in file system tools are sufficient. MCP adds the most value for tasks that span multiple systems (for example, updating both code and documentation, or verifying a code change against a running application).&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is the Native Format&lt;/h3&gt;
&lt;p&gt;Skills, KIs, and artifacts are all Markdown. If you are creating context documents for Antigravity, use Markdown.&lt;/p&gt;
&lt;h3&gt;For External References&lt;/h3&gt;
&lt;p&gt;PDF documents can be provided as context through conversation uploads. However, for persistent reference material, converting to Markdown and placing it in a project directory (or as a Skill resource) provides better integration with Antigravity&apos;s context system.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Skill-Driven Development Pattern&lt;/h3&gt;
&lt;p&gt;Create Skills for every major workflow in your development process:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;deploy-staging&lt;/code&gt; for deployment&lt;/li&gt;
&lt;li&gt;&lt;code&gt;create-api-endpoint&lt;/code&gt; for new endpoints&lt;/li&gt;
&lt;li&gt;&lt;code&gt;database-migration&lt;/code&gt; for schema changes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;security-audit&lt;/code&gt; for security reviews&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When you need to perform one of these tasks, point Antigravity at the relevant Skill. This ensures consistent execution regardless of which team member is working.&lt;/p&gt;
&lt;h3&gt;The Knowledge Accumulation Pattern&lt;/h3&gt;
&lt;p&gt;Treat KIs as a growing knowledge base about your project:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First session: Antigravity learns the basic architecture&lt;/li&gt;
&lt;li&gt;Subsequent sessions: KIs accumulate details about specific modules, patterns, and decisions&lt;/li&gt;
&lt;li&gt;Over time: Antigravity starts with a deep understanding of your project every session&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This compounds over weeks and months, making the AI increasingly effective.&lt;/p&gt;
&lt;h3&gt;The Paired Review Pattern&lt;/h3&gt;
&lt;p&gt;Use Antigravity&apos;s PLANNING phase as a design review:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Describe the feature or change you want&lt;/li&gt;
&lt;li&gt;Review the implementation plan Antigravity creates&lt;/li&gt;
&lt;li&gt;Provide feedback and iterate on the plan&lt;/li&gt;
&lt;li&gt;Only approve execution once the plan meets your standards&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This catches design issues before code is written, saving significant time.&lt;/p&gt;
&lt;h3&gt;The Task Decomposition Pattern&lt;/h3&gt;
&lt;p&gt;For large features, let Antigravity break the work into multiple task boundary segments:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Tell Antigravity the overall goal&lt;/li&gt;
&lt;li&gt;It creates a task.md with subtasks&lt;/li&gt;
&lt;li&gt;Each subtask gets its own PLANNING &amp;gt; EXECUTION &amp;gt; VERIFICATION cycle&lt;/li&gt;
&lt;li&gt;The walkthrough artifact captures the full story for future reference&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring KI summaries.&lt;/strong&gt; Antigravity provides KI summaries at the start of each conversation. Skipping them leads to redundant work and missed context.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not creating Skills for repeatable work.&lt;/strong&gt; If you find yourself explaining the same workflow multiple times, it should be a Skill.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping the PLANNING phase.&lt;/strong&gt; Jumping straight to execution means no implementation plan to review. The PLANNING phase is where Antigravity aligns with your intent.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing artifacts.&lt;/strong&gt; Implementation plans and walkthroughs are designed for human review. Skipping them defeats the purpose of Antigravity&apos;s transparency system.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-relying on conversation context.&lt;/strong&gt; Conversation history is ephemeral. For information that should persist, ensure it gets captured in Skills or KIs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not building Workflows for common tasks.&lt;/strong&gt; The &lt;code&gt;.agents/workflows/&lt;/code&gt; directory supports step-by-step guides that Antigravity follows precisely. These are particularly useful for onboarding, deployment, and maintenance tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding agents and managing context across development workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Gemini CLI: A Complete Guide to Terminal-Native AI Development</title><link>https://iceberglakehouse.com/posts/2026-03-context-gemini-cli/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-gemini-cli/</guid><description>
Gemini CLI is an open-source terminal agent powered by Gemini models that operates directly in your command line. It brings Google&apos;s AI capabilities ...</description><pubDate>Sat, 07 Mar 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Gemini CLI is an open-source terminal agent powered by Gemini models that operates directly in your command line. It brings Google&apos;s AI capabilities into the environment where many developers already live, with a context management system built around hierarchical configuration files, persistent memory, MCP server integration, and direct codebase interaction. Unlike web-based tools where context is managed through uploads and conversation, Gemini CLI assembles its context from your project structure, your instruction files, and the tools you connect to it.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism in Gemini CLI and explains how to configure them for productive development workflows.&lt;/p&gt;
&lt;h2&gt;How Gemini CLI Assembles Context&lt;/h2&gt;
&lt;p&gt;Gemini CLI builds its working context from multiple sources, loaded in a specific hierarchy:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Global GEMINI.md&lt;/strong&gt; (&lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt;) - personal preferences that apply everywhere&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project GEMINI.md&lt;/strong&gt; (in your project directory, walking up to the root) - project conventions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subdirectory GEMINI.md files&lt;/strong&gt; - component-specific instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory entries&lt;/strong&gt; - facts you have told the CLI to remember&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server tools&lt;/strong&gt; - external data sources and capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The current codebase&lt;/strong&gt; - files, dependencies, project structure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The conversation&lt;/strong&gt; - your prompts and responses in the current session&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;More specific sources take precedence over general ones. A subdirectory GEMINI.md instruction overrides a project-level GEMINI.md instruction on the same topic.&lt;/p&gt;
&lt;h2&gt;GEMINI.md: Persistent Project Context&lt;/h2&gt;
&lt;p&gt;GEMINI.md is the foundational context mechanism. It is a Markdown file that Gemini CLI loads automatically before every interaction.&lt;/p&gt;
&lt;h3&gt;File Hierarchy&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All projects&lt;/td&gt;
&lt;td&gt;Personal coding style, universal preferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./GEMINI.md&lt;/code&gt; (project root)&lt;/td&gt;
&lt;td&gt;Current project&lt;/td&gt;
&lt;td&gt;Architecture, stack, conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./src/GEMINI.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Specific directory&lt;/td&gt;
&lt;td&gt;Module-specific patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;What to Include&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# GEMINI.md

## Project: E-Commerce API

- Framework: Express.js on Node 22
- Database: PostgreSQL 16 with Drizzle ORM
- Testing: Vitest with supertest for API tests
- Deployment: Docker containers on Cloud Run

## Code Conventions

- Use ESM imports (no CommonJS require)
- All route handlers are async functions
- Error handling uses a centralized error middleware
- SQL migrations use Drizzle Kit

## Architecture

- Routes: src/routes/
- Services: src/services/ (business logic)
- Models: src/models/ (Drizzle schema)
- Middleware: src/middleware/
- Tests: tests/ (mirrors src/ structure)

## Do Not

- Do not use default exports
- Do not install packages without noting them
- Do not modify migration files after they have been applied
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Modular GEMINI.md Files&lt;/h3&gt;
&lt;p&gt;For complex projects, GEMINI.md files can import other Markdown files. This keeps individual files focused while allowing the CLI to assemble comprehensive context:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# GEMINI.md

See also:

- @docs/coding-standards.md
- @docs/api-conventions.md
- @docs/testing-strategy.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The /init Command&lt;/h3&gt;
&lt;p&gt;If you are starting a new project or onboarding Gemini CLI to an existing one, run &lt;code&gt;/init&lt;/code&gt;. This command analyzes your project structure and generates a starting GEMINI.md file that captures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detected frameworks and languages&lt;/li&gt;
&lt;li&gt;Project structure&lt;/li&gt;
&lt;li&gt;Build and test commands&lt;/li&gt;
&lt;li&gt;Basic conventions inferred from the code&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Review and edit the generated file. The auto-detection is a starting point, not a finished product. Add your team&apos;s conventions, architectural decisions, and quality standards to make it comprehensive. The value of /init is that it saves you from writing the boilerplate sections (project type, folder structure, detected dependencies) so you can focus on the human-knowledge sections.&lt;/p&gt;
&lt;h2&gt;Memory: Persistent Facts Across Sessions&lt;/h2&gt;
&lt;p&gt;Gemini CLI&apos;s memory system stores persistent facts that apply across all sessions and projects (when stored globally).&lt;/p&gt;
&lt;h3&gt;Adding Memories&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;/memory add We use the Google Python Style Guide for all Python code
/memory add Our PostgreSQL database runs on port 5433, not the default 5432
/memory add Always use UTC timestamps in database columns
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Viewing Memories&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;/memory show
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This displays all active memories, including those from GEMINI.md files and explicit memory entries.&lt;/p&gt;
&lt;h3&gt;Refreshing Context&lt;/h3&gt;
&lt;p&gt;If you update GEMINI.md files outside of the current session, use:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/memory refresh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This reloads all context files without restarting the CLI.&lt;/p&gt;
&lt;h3&gt;Memory Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use memory for facts that are true across projects (your personal conventions)&lt;/li&gt;
&lt;li&gt;Use GEMINI.md for project-specific context&lt;/li&gt;
&lt;li&gt;Keep memories concise: &amp;quot;Use Ruff for Python linting&amp;quot; rather than a paragraph explaining why&lt;/li&gt;
&lt;li&gt;Review memories periodically with &lt;code&gt;/memory show&lt;/code&gt; and remove outdated entries&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Direct Context Injection with @&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;@&lt;/code&gt; command lets you inject specific files or directories directly into a prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;@src/models/user.ts How should I add a preferences field to this model?
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;@src/routes/ Review all route handlers for consistent error handling
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the most direct way to give Gemini CLI context about specific files. Unlike other tools that require uploads, the @ command reads from your local file system in real time.&lt;/p&gt;
&lt;h3&gt;When to Use @&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;When your question relates to specific files that Gemini CLI might not automatically discover&lt;/li&gt;
&lt;li&gt;When you want to ensure the agent reads the latest version of a file&lt;/li&gt;
&lt;li&gt;When you want to focus the agent on a particular section of the codebase&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Gemini CLI supports MCP through its &lt;code&gt;settings.json&lt;/code&gt; configuration. MCP servers extend the CLI&apos;s capabilities by connecting it to external tools and data sources.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured in &lt;code&gt;settings.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;github&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-github&amp;quot;]
    },
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;httpUrl&amp;quot;: &amp;quot;http://localhost:3001/mcp&amp;quot;
    },
    &amp;quot;custom-tool&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;python&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;./scripts/my-mcp-server.py&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;API_KEY&amp;quot;: &amp;quot;${MY_API_KEY}&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the environment variable expansion (&lt;code&gt;${MY_API_KEY}&lt;/code&gt;), which lets you keep credentials out of configuration files.&lt;/p&gt;
&lt;h3&gt;Transport Options&lt;/h3&gt;
&lt;p&gt;Gemini CLI supports three MCP transport mechanisms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;stdio:&lt;/strong&gt; The server runs as a local process (most common for development)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSE (Server-Sent Events):&lt;/strong&gt; For remote servers using the &lt;code&gt;url&lt;/code&gt; property&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HTTP Streaming:&lt;/strong&gt; For modern HTTP-based servers using the &lt;code&gt;httpUrl&lt;/code&gt; property&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP Prompts as Slash Commands&lt;/h3&gt;
&lt;p&gt;MCP servers can expose predefined prompts as slash commands. If a connected server exposes a prompt named &amp;quot;analyze-performance,&amp;quot; you can invoke it with &lt;code&gt;/analyze-performance&lt;/code&gt; directly in the CLI.&lt;/p&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use MCP for:&lt;/strong&gt; Database access, GitHub integration, browser automation, accessing internal APIs, connecting to project management tools&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skip MCP when:&lt;/strong&gt; The task is code-only and the files are already on your local system. Gemini CLI can read files and run terminal commands directly without MCP.&lt;/p&gt;
&lt;h2&gt;Dynamic Shell Context&lt;/h2&gt;
&lt;p&gt;One of Gemini CLI&apos;s unique strengths is its ability to execute shell commands to gather real-time context. This means the agent can check the actual state of your system rather than relying on static descriptions.&lt;/p&gt;
&lt;h3&gt;Practical Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Check current Git state:&lt;/strong&gt; The agent can run &lt;code&gt;git status&lt;/code&gt; or &lt;code&gt;git log&lt;/code&gt; to understand what has changed recently, which branch you are on, and what commits are pending&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inspect running services:&lt;/strong&gt; Commands like &lt;code&gt;docker ps&lt;/code&gt; or &lt;code&gt;kubectl get pods&lt;/code&gt; give the agent visibility into your running infrastructure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read live configuration:&lt;/strong&gt; The agent can check environment variables, read &lt;code&gt;.env&lt;/code&gt; files, or inspect running process configurations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify test results:&lt;/strong&gt; Running your test suite and analyzing the output gives the agent concrete data about what is passing and what is failing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dynamic context is especially valuable for debugging workflows, where the agent needs to understand both the code and the runtime environment.&lt;/p&gt;
&lt;h2&gt;Automatic Codebase Exploration&lt;/h2&gt;
&lt;p&gt;Gemini CLI automatically explores your project structure using tools that respect &lt;code&gt;.gitignore&lt;/code&gt; patterns. It will not waste context on &lt;code&gt;node_modules/&lt;/code&gt;, &lt;code&gt;__pycache__/&lt;/code&gt;, or build output. It also detects project types from configuration files (for example, finding &lt;code&gt;package.json&lt;/code&gt; tells it this is a Node.js project).&lt;/p&gt;
&lt;p&gt;This automatic exploration means you can ask broad questions like &amp;quot;What database does this project use?&amp;quot; and the agent will find the answer by scanning relevant configuration files. However, GEMINI.md files significantly improve results by providing context that cannot be inferred from code alone: team decisions, architectural rationale, and development philosophy.&lt;/p&gt;
&lt;h2&gt;Choosing Gemini CLI vs. Other Terminal Agents&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose Gemini CLI over Claude Code when:&lt;/strong&gt; You prefer Google&apos;s Gemini models, need the hierarchical GEMINI.md system, or want MCP prompts exposed as slash commands.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Gemini CLI over OpenCode when:&lt;/strong&gt; You want a simpler, more focused tool without OpenCode&apos;s TUI interface, or you are already invested in the Google ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Gemini CLI over Codex CLI when:&lt;/strong&gt; You want an open-source tool you can inspect and modify, or you prefer interactive terminal sessions over Codex&apos;s sandbox model.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;h3&gt;For Quick Questions&lt;/h3&gt;
&lt;p&gt;Just ask. Gemini CLI can explore your codebase on its own:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;What database ORM does this project use?
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The CLI will scan your project files, find the relevant configuration, and answer.&lt;/p&gt;
&lt;h3&gt;For Targeted Changes&lt;/h3&gt;
&lt;p&gt;Provide file references and constraints:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;@src/services/auth.ts Add rate limiting to the login function.
Use express-rate-limit with a 100-request-per-minute window.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;For Large Features&lt;/h3&gt;
&lt;p&gt;Invest in GEMINI.md, set up relevant MCP servers, and use the multi-step approach: plan first, then implement.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is Native&lt;/h3&gt;
&lt;p&gt;GEMINI.md files, memory entries, and context documents should all be Markdown. The format is native to Gemini CLI&apos;s context system.&lt;/p&gt;
&lt;h3&gt;PDFs Need Conversion&lt;/h3&gt;
&lt;p&gt;Gemini CLI primarily works with text-based formats. If you have reference material in PDF form, extract the relevant sections into Markdown files and place them in your project directory. This makes them accessible via @ references and GEMINI.md imports.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Context-Aware Shell Script&lt;/h3&gt;
&lt;p&gt;Create shell scripts that set up project context before launching Gemini CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;#!/bin/bash
# Start Gemini CLI with project-specific context
cd ~/projects/my-api
export DB_URL=&amp;quot;postgresql://dev@localhost:5433/mydb&amp;quot;
gemini
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures the CLI starts in the right directory with the right environment variables, reducing context-switching overhead.&lt;/p&gt;
&lt;h3&gt;The Exploration-First Pattern&lt;/h3&gt;
&lt;p&gt;Before starting a new feature:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Analyze the current authentication system.
Describe the flow from login to token validation.
Do not make any changes.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Review the analysis, correct any misunderstandings, and then proceed with the implementation task.&lt;/p&gt;
&lt;h3&gt;The Automated Context Generation Pattern&lt;/h3&gt;
&lt;p&gt;Use the &lt;code&gt;/init&lt;/code&gt; command periodically (or a custom script) to regenerate your GEMINI.md file as the project evolves. This keeps the context file synchronized with the actual state of the codebase.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No GEMINI.md.&lt;/strong&gt; Without it, the CLI starts with no project context. It can still explore your codebase, but it will make assumptions that may not match your conventions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stale GEMINI.md.&lt;/strong&gt; A GEMINI.md that references frameworks or patterns you no longer use creates confusion. Update it when you make significant changes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overloading memory.&lt;/strong&gt; Memory is for brief, stable facts. Do not try to store entire documents as memory entries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adding unnecessary MCP servers.&lt;/strong&gt; Each connected server adds tools that the CLI must evaluate. Only connect servers you actively use.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using @ for targeted questions.&lt;/strong&gt; Pointing Gemini CLI at specific files with @ produces more focused results than letting it search the entire project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring /init.&lt;/strong&gt; For new projects, /init generates a solid starting GEMINI.md in seconds. Review and refine it rather than writing from scratch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forgetting to refresh after external edits.&lt;/strong&gt; If you edit GEMINI.md files in your text editor, run &lt;code&gt;/memory refresh&lt;/code&gt; so the CLI picks up the changes immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writing overly long GEMINI.md files.&lt;/strong&gt; GEMINI.md should be focused and scannable. If it exceeds 500 lines, consider splitting it into modular imported files. A concise GEMINI.md with clear sections is more effective than a sprawling document.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Gemini Web and NotebookLM: A Complete Guide to Google&apos;s AI Knowledge Ecosystem</title><link>https://iceberglakehouse.com/posts/2026-03-context-gemini-web-notebooklm/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-gemini-web-notebooklm/</guid><description>
Google&apos;s AI ecosystem for knowledge work consists of two deeply integrated tools: Gemini (the conversational AI at gemini.google.com) and NotebookLM ...</description><pubDate>Sat, 07 Mar 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google&apos;s AI ecosystem for knowledge work consists of two deeply integrated tools: Gemini (the conversational AI at gemini.google.com) and NotebookLM (the research-focused assistant at notebooklm.google.com). In early 2026, these two platforms became interoperable, allowing Gemini to access information stored in NotebookLM notebooks. This integration creates something unique in the AI landscape: a persistent knowledge infrastructure where documents you upload once become available across both conversational and research interfaces.&lt;/p&gt;
&lt;p&gt;This guide covers context management strategies for both Gemini Web and NotebookLM, with a focus on how to use them together for maximum effectiveness.&lt;/p&gt;
&lt;h2&gt;Gemini Web: Context Management Fundamentals&lt;/h2&gt;
&lt;h3&gt;The Context Window&lt;/h3&gt;
&lt;p&gt;Gemini supports one of the largest context windows available, with models like Gemini 3 Pro and Gemini 2.5 Pro offering up to 2 million tokens. This is approximately 1.5 million words of input capacity, enough to process entire books, large codebases, or years of financial data in a single conversation.&lt;/p&gt;
&lt;p&gt;The context window includes everything: your system instructions, conversation history, uploaded files, and Gemini&apos;s responses. While 2 million tokens is enormous, strategic context management still matters because relevance, not volume, determines response quality.&lt;/p&gt;
&lt;h3&gt;Custom Instructions&lt;/h3&gt;
&lt;p&gt;Gemini supports custom instructions that shape how it responds across conversations. Access these through Gemini&apos;s settings. Effective custom instructions include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your professional background and expertise level&lt;/li&gt;
&lt;li&gt;Preferred response style (concise vs. detailed, formal vs. conversational)&lt;/li&gt;
&lt;li&gt;Output format preferences (bullet points, structured sections, code formatting)&lt;/li&gt;
&lt;li&gt;Domain-specific terminology or constraints&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Gems: Specialized AI Assistants&lt;/h3&gt;
&lt;p&gt;Gems are custom AI mini-apps within Gemini. You create a Gem by defining its purpose, instructions, and behavior. Unlike custom instructions (which apply globally), each Gem operates with its own specialized context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Gems for repeatable workflows:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &amp;quot;Technical Writer&amp;quot; Gem with your style guide and terminology baked in&lt;/li&gt;
&lt;li&gt;A &amp;quot;Data Analyst&amp;quot; Gem that knows your preferred visualization tools and analysis frameworks&lt;/li&gt;
&lt;li&gt;A &amp;quot;Meeting Prep&amp;quot; Gem that generates agendas and briefing documents in your format&lt;/li&gt;
&lt;li&gt;A &amp;quot;Code Reviewer&amp;quot; Gem that applies your team&apos;s coding standards consistently&lt;/li&gt;
&lt;li&gt;A &amp;quot;Content Editor&amp;quot; Gem that checks for brand voice compliance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To create a Gem, navigate to the Gems section in Gemini, define its instructions, and optionally upload knowledge files. Once created, you can invoke the Gem anytime without re-establishing its context.&lt;/p&gt;
&lt;h3&gt;Building Effective Gems&lt;/h3&gt;
&lt;p&gt;The quality of a Gem depends entirely on the quality of its instructions. Write Gem instructions as if you are onboarding a new team member to a specific role:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Define the Gem&apos;s role&lt;/strong&gt; (&amp;quot;You are a technical documentation editor for a developer tools company&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specify the audience&lt;/strong&gt; (&amp;quot;Write for senior developers who know the basics but need guidance on advanced topics&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set quality standards&lt;/strong&gt; (&amp;quot;Every section must include at least one code example, use active voice, and stay under 300 words per subsection&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include anti-patterns&lt;/strong&gt; (&amp;quot;Never use jargon without defining it first, never assume the reader has used this tool before&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide examples&lt;/strong&gt; of the desired output style when possible&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Notebooks (Projects)&lt;/h3&gt;
&lt;p&gt;Gemini is rolling out &amp;quot;Notebooks&amp;quot; (an evolution of its Projects feature) that let you group conversations by topic and set per-notebook custom instructions. This mirrors the Project concept in other AI tools: a workspace where context persists across conversations.&lt;/p&gt;
&lt;p&gt;Within a Notebook:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set instructions specific to the topic or project&lt;/li&gt;
&lt;li&gt;Upload files that Gemini can reference in every conversation&lt;/li&gt;
&lt;li&gt;Maintain a collection of related conversations without losing context between them&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;File Uploads&lt;/h3&gt;
&lt;p&gt;Gemini Web supports direct file uploads in conversations:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Research papers, specifications, reports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Docs, Word files for editing or analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spreadsheets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data analysis, financial modeling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Images&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual context, screenshots, diagrams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transcription and analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Video&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual content analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Google Workspace Integration&lt;/h3&gt;
&lt;p&gt;A distinctive Gemini feature is its integration with Google Workspace. With &amp;quot;Personal Intelligence&amp;quot; (available in 2026), Gemini can securely access your Gmail, Drive, Docs, and Calendar to provide context-aware responses grounded in your actual work data. This means Gemini can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Search your email history for relevant communications&lt;/li&gt;
&lt;li&gt;Reference documents in your Google Drive&lt;/li&gt;
&lt;li&gt;Check your calendar when you ask about scheduling&lt;/li&gt;
&lt;li&gt;Pull data from your spreadsheets for analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This integration effectively makes your entire Google Workspace a context source, something no other AI platform currently matches.&lt;/p&gt;
&lt;h2&gt;NotebookLM: Deep Research Context Management&lt;/h2&gt;
&lt;p&gt;NotebookLM is purpose-built for research and knowledge work. Its context management is centered around &amp;quot;notebooks,&amp;quot; each of which contains sources (your uploaded documents) and a conversation interface grounded in those sources.&lt;/p&gt;
&lt;h3&gt;How NotebookLM Handles Context&lt;/h3&gt;
&lt;p&gt;Unlike Gemini (which can draw on its entire training data), NotebookLM responses are grounded exclusively in the sources you upload. This is a feature, not a limitation. When you need answers based specifically on your documents (not the model&apos;s general knowledge), NotebookLM provides citation-backed responses that reference specific sections of your sources.&lt;/p&gt;
&lt;h3&gt;Source Types&lt;/h3&gt;
&lt;p&gt;NotebookLM supports a wide range of source types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PDFs:&lt;/strong&gt; Research papers, reports, legal documents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Docs:&lt;/strong&gt; Your own writing, notes, and drafts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Slides:&lt;/strong&gt; Presentation content&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web URLs:&lt;/strong&gt; Articles, documentation, and reference pages&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;YouTube videos:&lt;/strong&gt; Automatic transcription and analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audio files:&lt;/strong&gt; Podcast episodes, interviews, lectures&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Text files:&lt;/strong&gt; Any plaintext content&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Up to 50 sources per notebook (500,000 words or 200MB per source)
&lt;strong&gt;NotebookLM Pro:&lt;/strong&gt; Up to 300 sources per notebook&lt;/p&gt;
&lt;h3&gt;Custom Instructions in NotebookLM&lt;/h3&gt;
&lt;p&gt;NotebookLM supports per-notebook custom instructions. You can set:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Response style (&amp;quot;Explain like I am new to this field&amp;quot;)&lt;/li&gt;
&lt;li&gt;Response length preferences&lt;/li&gt;
&lt;li&gt;Tone (academic, conversational, technical)&lt;/li&gt;
&lt;li&gt;Specific focus areas within your sources&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Audio Overviews&lt;/h3&gt;
&lt;p&gt;NotebookLM&apos;s Audio Overview feature generates podcast-style discussions of your uploaded sources. This is a unique context consumption approach: instead of reading AI-generated summaries, you listen to a natural conversation about your documents. Audio Overviews are useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Getting a high-level understanding of dense material before deep-reading&lt;/li&gt;
&lt;li&gt;Reviewing research while multitasking&lt;/li&gt;
&lt;li&gt;Sharing knowledge with colleagues who prefer audio formats&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Using Gemini and NotebookLM Together&lt;/h2&gt;
&lt;p&gt;The integration between Gemini and NotebookLM is where the real power emerges.&lt;/p&gt;
&lt;h3&gt;The Knowledge Flow&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Upload sources to NotebookLM:&lt;/strong&gt; Research papers, reports, specifications&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Let NotebookLM build a grounded knowledge base:&lt;/strong&gt; Ask questions, generate summaries, create Audio Overviews&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Import that notebook into Gemini:&lt;/strong&gt; Gemini gains access to all your NotebookLM sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Gemini for broader analysis:&lt;/strong&gt; Gemini combines your specific sources with its general knowledge and web search&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This workflow gives you both grounded, citation-backed analysis (NotebookLM) and broader contextual understanding (Gemini) from the same source material.&lt;/p&gt;
&lt;h3&gt;When to Use Each&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Need&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Answers grounded strictly in your documents&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broad research with web search integration&lt;/td&gt;
&lt;td&gt;Gemini Web&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Citation-backed analysis of specific papers&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative ideation and brainstorming&lt;/td&gt;
&lt;td&gt;Gemini Web&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audio summaries of research material&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration with Google Workspace data&lt;/td&gt;
&lt;td&gt;Gemini Web&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Side-by-side comparison of source documents&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Gems as Auto-Syncing Brains&lt;/h3&gt;
&lt;p&gt;When you create a Gem that is linked to a NotebookLM notebook, the Gem automatically stays in sync with the notebook&apos;s sources. Add a new document to the notebook, and the Gem can reference it immediately. This creates a &amp;quot;specialized brain&amp;quot; that continuously learns from your latest research without requiring you to re-upload files or restate context.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;In NotebookLM&lt;/h3&gt;
&lt;p&gt;NotebookLM works well with PDFs because it extracts and indexes the content for citation-backed retrieval. Since NotebookLM&apos;s primary job is to ground responses in specific documents, PDFs are perfectly suited for this use case.&lt;/p&gt;
&lt;p&gt;However, for your own notes, outlines, and structured reference material, Google Docs or Markdown files (uploaded as text) provide cleaner parsing and are easier to update.&lt;/p&gt;
&lt;h3&gt;In Gemini Web&lt;/h3&gt;
&lt;p&gt;Gemini handles both PDFs and text-based formats well, but the same general rule applies: Markdown and plaintext provide the cleanest AI-parseable context. Use PDFs for published documents you received from others, and Markdown or Google Docs for context you author yourself.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;As of early 2026, Gemini Web and NotebookLM do not support MCP (Model Context Protocol) server connections. MCP support is available in the Gemini CLI, which is covered in a separate guide.&lt;/p&gt;
&lt;p&gt;For web-based Gemini usage, the Google Workspace integration provides similar benefits to MCP for many use cases: live access to your email, documents, spreadsheets, and calendar. If you need connections to non-Google services (databases, third-party APIs), use the Gemini CLI instead.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Research Pipeline&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Collect sources in NotebookLM&lt;/strong&gt; (upload papers, articles, reports)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate an Audio Overview&lt;/strong&gt; for high-level understanding&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ask targeted questions in NotebookLM&lt;/strong&gt; for citation-backed answers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Import the notebook to Gemini&lt;/strong&gt; for broader analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Gemini with web search&lt;/strong&gt; to find related work not in your sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Draft your output in Gemini&lt;/strong&gt; using both grounded sources and general knowledge&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Knowledge Base Strategy&lt;/h3&gt;
&lt;p&gt;Use NotebookLM notebooks as persistent knowledge bases for different domains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Industry Research&amp;quot;&lt;/strong&gt; notebook with market reports and analyst papers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Technical Reference&amp;quot;&lt;/strong&gt; notebook with API docs and architecture papers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Competitive Intelligence&amp;quot;&lt;/strong&gt; notebook with competitor materials&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each notebook becomes a specialized resource that you can query independently or combine with Gemini for cross-domain analysis.&lt;/p&gt;
&lt;h3&gt;The Document Synthesis Pattern&lt;/h3&gt;
&lt;p&gt;When you need to synthesize multiple long documents:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload all documents to a single NotebookLM notebook&lt;/li&gt;
&lt;li&gt;Ask NotebookLM to summarize each document individually&lt;/li&gt;
&lt;li&gt;Ask it to identify common themes across all documents&lt;/li&gt;
&lt;li&gt;Ask it to highlight contradictions or disagreements between documents&lt;/li&gt;
&lt;li&gt;Use the results in Gemini for a final synthesized analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach leverages NotebookLM&apos;s grounding capability for accurate summarization and Gemini&apos;s broader intelligence for synthesis.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Gemini and NotebookLM&lt;/h2&gt;
&lt;h3&gt;In Gemini: Lead with Purpose&lt;/h3&gt;
&lt;p&gt;Because Gemini has such a large context window, it is tempting to dump everything in and hope for the best. Resist this. Structure your inputs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;State your goal first&lt;/strong&gt; (&amp;quot;I need a comparison table of three database solutions&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide the relevant data&lt;/strong&gt; (paste or reference uploaded files)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specify the output format&lt;/strong&gt; (&amp;quot;Create a markdown table with columns for Feature, Solution A, Solution B, Solution C&amp;quot;)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pattern works because Gemini prioritizes recent and explicit instructions over ambient context.&lt;/p&gt;
&lt;h3&gt;In NotebookLM: Trust the Grounding&lt;/h3&gt;
&lt;p&gt;NotebookLM is designed to answer from your sources. You do not need to paste content into the chat because the sources are already indexed. Instead, ask specific questions that require the AI to synthesize across your documents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Compare how Document A and Document B define the term &apos;data mesh&apos;&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;What evidence in my sources supports the claim that real-time processing reduces costs?&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;Identify contradictions between the 2024 and 2025 reports on this topic&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using Gemini when you need citations.&lt;/strong&gt; If you need responses backed by specific sources, use NotebookLM. Gemini&apos;s general knowledge is powerful but cannot provide page-level citations from your documents.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overloading a single NotebookLM notebook.&lt;/strong&gt; While Pro supports 300 sources, having too many unrelated documents in one notebook dilutes the AI&apos;s focus. Create separate notebooks for distinct topics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Audio Overviews.&lt;/strong&gt; Audio Overviews are one of NotebookLM&apos;s most underused features. They provide an efficient way to internalize complex material, especially before you start asking detailed questions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring the Gemini-NotebookLM integration.&lt;/strong&gt; Using these tools in isolation means you miss the most powerful workflow: grounded research in NotebookLM feeding into broader analysis in Gemini.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping custom instructions.&lt;/strong&gt; Both Gemini and NotebookLM support per-workspace custom instructions. Setting these up takes minutes and saves hours of course-correcting the AI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Gems for repeatable tasks.&lt;/strong&gt; If you find yourself giving Gemini the same instructions repeatedly, create a Gem and save that context permanently.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about context management strategies for AI tools, research workflows, and agentic systems, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude Code: A Complete Guide for Developers</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-code/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-code/</guid><description>
Claude Code is a terminal-native agentic coding assistant that lives in your command line and operates directly on your codebase. Unlike chat-based i...</description><pubDate>Sat, 07 Mar 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude Code is a terminal-native agentic coding assistant that lives in your command line and operates directly on your codebase. Unlike chat-based interfaces where you copy and paste code snippets, Claude Code reads your files, explores your project structure, runs commands, executes tests, and commits changes. Context management in Claude Code is about configuring the agent&apos;s persistent knowledge of your project so it can operate effectively without constant direction.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism in Claude Code, from the foundational CLAUDE.md file to MCP integrations and multi-agent orchestration.&lt;/p&gt;
&lt;h2&gt;How Claude Code Manages Context&lt;/h2&gt;
&lt;p&gt;Claude Code builds its context from several sources, layered from most persistent to most ephemeral:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;CLAUDE.md files&lt;/strong&gt; (permanent project instructions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MEMORY.md&lt;/strong&gt; (automatically maintained session memory)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; (live external data)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The codebase itself&lt;/strong&gt; (files, dependencies, project structure)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The current conversation&lt;/strong&gt; (your commands and the agent&apos;s responses)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Command output&lt;/strong&gt; (terminal results, test output, error messages)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The agent combines all of these into a working context that informs how it approaches tasks. The most effective Claude Code users invest time in the persistent layers (CLAUDE.md, MEMORY.md, MCP) so that every conversation starts with a solid foundation.&lt;/p&gt;
&lt;h2&gt;CLAUDE.md: Your Project&apos;s Instruction Manual&lt;/h2&gt;
&lt;p&gt;CLAUDE.md is the primary mechanism for giving Claude Code persistent context about your project. It is a Markdown file that Claude reads at the start of every session.&lt;/p&gt;
&lt;h3&gt;File Locations and Hierarchy&lt;/h3&gt;
&lt;p&gt;Claude Code loads CLAUDE.md files from multiple locations, combining them into a single instruction set:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Use For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Global (all projects)&lt;/td&gt;
&lt;td&gt;Personal preferences, universal standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./CLAUDE.md&lt;/code&gt; (project root)&lt;/td&gt;
&lt;td&gt;Project-wide&lt;/td&gt;
&lt;td&gt;Architecture, coding standards, testing strategy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./src/CLAUDE.md&lt;/code&gt; (subdirectory)&lt;/td&gt;
&lt;td&gt;Component-specific&lt;/td&gt;
&lt;td&gt;Module-specific patterns, API conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;More specific files supplement more general ones. If your global CLAUDE.md says &amp;quot;use 2-space indentation&amp;quot; but your project CLAUDE.md says &amp;quot;use 4-space indentation,&amp;quot; the project-level instruction takes precedence.&lt;/p&gt;
&lt;h3&gt;What to Include in CLAUDE.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# CLAUDE.md

## Project Overview

This is a Python FastAPI application with a React frontend.
Backend: Python 3.12, FastAPI, SQLAlchemy, PostgreSQL 16
Frontend: TypeScript, React 19, Vite 6, Zustand
Testing: pytest (backend), Vitest (frontend)

## Build and Run Commands

- Backend: `uvicorn app.main:app --reload`
- Frontend: `npm run dev`
- Tests: `pytest` (backend), `npm test` (frontend)
- Lint: `ruff check .` (backend), `npm run lint` (frontend)

## Code Conventions

- Use type hints for all function signatures
- Use Pydantic models for API request/response schemas
- Use async functions for all database operations
- Prefer composition over inheritance
- Keep functions under 30 lines; extract helpers for longer logic

## Testing Requirements

- Every new endpoint needs integration tests
- Every utility function needs unit tests
- Mock external services; never hit real APIs in tests
- Use factories (not fixtures) for test data creation

## Architecture Decisions

- We use the repository pattern for database access
- All business logic lives in the service layer, not in route handlers
- Frontend state is managed exclusively through Zustand stores
- API responses follow the JSON:API specification
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CLAUDE.md Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Be specific and actionable.&lt;/strong&gt; &amp;quot;Write clean code&amp;quot; is useless. &amp;quot;Functions should have a single responsibility and no side effects&amp;quot; is useful.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include build and test commands.&lt;/strong&gt; Claude Code will run these commands to verify its work. If it does not know your test command, it cannot validate changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document your architecture.&lt;/strong&gt; Tell Claude Code where things live. &amp;quot;Database models are in &lt;code&gt;app/models/&lt;/code&gt;&amp;quot; saves the agent from exploring your entire project structure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use negative constraints.&lt;/strong&gt; &amp;quot;Do not use class-based views&amp;quot; and &amp;quot;Never import directly from internal modules; use the public API&amp;quot; prevent common mistakes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep it current.&lt;/strong&gt; An outdated CLAUDE.md with references to deprecated patterns causes more harm than having no CLAUDE.md at all.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MEMORY.md: Automatic Session Memory&lt;/h2&gt;
&lt;p&gt;MEMORY.md is a file that Claude Code creates and maintains automatically to persist important context across sessions. When you share information that Claude determines is worth remembering (project decisions, your preferences, issue resolutions), it writes that information to MEMORY.md.&lt;/p&gt;
&lt;h3&gt;How MEMORY.md Works&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Claude Code creates &lt;code&gt;~/.claude/MEMORY.md&lt;/code&gt; automatically&lt;/li&gt;
&lt;li&gt;During conversations, when you share important context, Claude offers to save it&lt;/li&gt;
&lt;li&gt;In subsequent sessions, Claude reads MEMORY.md before starting work&lt;/li&gt;
&lt;li&gt;You can also manually edit MEMORY.md to add or remove memories&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Gets Stored&lt;/h3&gt;
&lt;p&gt;Typical MEMORY.md entries include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project preferences you have stated (&amp;quot;I prefer named exports over default exports&amp;quot;)&lt;/li&gt;
&lt;li&gt;Decisions you have made (&amp;quot;We chose Redis for session storage because of its TTL support&amp;quot;)&lt;/li&gt;
&lt;li&gt;Debugging discoveries (&amp;quot;The auth middleware requires the Authorization header in lowercase&amp;quot;)&lt;/li&gt;
&lt;li&gt;Workflow notes (&amp;quot;Always run migrations before testing database changes&amp;quot;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Managing MEMORY.md&lt;/h3&gt;
&lt;p&gt;Review MEMORY.md periodically. Like any persistent context, stale entries can lead the agent astray. Remove entries that no longer apply and update ones that have changed.&lt;/p&gt;
&lt;p&gt;You can also use the &lt;code&gt;/memory&lt;/code&gt; slash command during a session to view what Claude currently remembers.&lt;/p&gt;
&lt;h2&gt;Slash Commands: Real-Time Context Control&lt;/h2&gt;
&lt;p&gt;Claude Code provides several slash commands for managing context during a session:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/context&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show all active context sources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/clear&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Clear conversation history (keeps CLAUDE.md and MEMORY.md)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/agent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spawn a sub-agent for a specific task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/memory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View and manage session memories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/help&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List available commands&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using /clear Strategically&lt;/h3&gt;
&lt;p&gt;Long sessions accumulate irrelevant context that can degrade Claude Code&apos;s focus. Use &lt;code&gt;/clear&lt;/code&gt; when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are switching to a different part of the codebase&lt;/li&gt;
&lt;li&gt;The conversation has gotten long and the agent seems confused&lt;/li&gt;
&lt;li&gt;You want to start a focused task without the baggage of previous exchanges&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that &lt;code&gt;/clear&lt;/code&gt; preserves your CLAUDE.md and MEMORY.md context. Only the conversation history is reset.&lt;/p&gt;
&lt;h3&gt;Using /agent for Sub-Tasks&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;/agent&lt;/code&gt; command spawns a sub-agent that operates independently with its own context. This is useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Exploring a part of the codebase without polluting your main conversation&lt;/li&gt;
&lt;li&gt;Running a time-consuming task (like a full test suite analysis) in parallel&lt;/li&gt;
&lt;li&gt;Dividing a large feature into independent pieces&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Claude Code supports MCP through the &lt;code&gt;claude mcp&lt;/code&gt; command, allowing you to connect external tools and data sources.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Add a database MCP server
claude mcp add postgres -- npx @anthropic/mcp-server-postgres

# Add a filesystem MCP server
claude mcp add files -- npx @anthropic/mcp-server-filesystem /path/to/project

# List active MCP servers
claude mcp list

# Remove an MCP server
claude mcp remove postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Practical MCP Use Cases for Developers&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Development database:&lt;/strong&gt; Let Claude Code query your dev database to understand schema, check data state, and verify migrations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Browser testing:&lt;/strong&gt; Connect a Playwright MCP server so Claude Code can verify frontend changes by interacting with a running application.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Git hosting:&lt;/strong&gt; Connect a GitHub or GitLab MCP server for creating pull requests, checking CI status, and reviewing code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Documentation systems:&lt;/strong&gt; Access internal docs or wikis that provide context not in the codebase.&lt;/p&gt;
&lt;h3&gt;When to Use MCP vs. Direct Commands&lt;/h3&gt;
&lt;p&gt;Claude Code can already run terminal commands. If you just need to see &lt;code&gt;git log&lt;/code&gt; or &lt;code&gt;psql -c &amp;quot;SELECT * FROM users&amp;quot;&lt;/code&gt;, Claude Code can run those directly. MCP is more useful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The interaction is structured and repeatable (not ad-hoc commands)&lt;/li&gt;
&lt;li&gt;You want Claude to have persistent access to a service across the entire session&lt;/li&gt;
&lt;li&gt;The MCP server provides tools that are safer or more convenient than raw commands&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;External Documents: When to Use PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;For Codebase Context: Always Markdown&lt;/h3&gt;
&lt;p&gt;CLAUDE.md, MEMORY.md, and any reference documents you create for Claude Code should be Markdown. The format is native to Claude Code&apos;s context system, version-controllable, and parses without ambiguity.&lt;/p&gt;
&lt;h3&gt;For External Specifications: Convert When Possible&lt;/h3&gt;
&lt;p&gt;If you have API specifications, design documents, or architecture diagrams in PDF form, consider extracting the relevant sections into Markdown and placing them in your repository. This way Claude Code can access them through normal file reading rather than requiring file upload.&lt;/p&gt;
&lt;h3&gt;For One-Off References&lt;/h3&gt;
&lt;p&gt;If you need Claude Code to reference a specific document during a session, paste the relevant content directly into the conversation. Claude Code&apos;s context window is large enough to handle substantial text inclusions.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Test-Driven Context Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Write failing tests that describe the behavior you want&lt;/li&gt;
&lt;li&gt;Tell Claude Code: &amp;quot;Make these tests pass&amp;quot;&lt;/li&gt;
&lt;li&gt;The tests themselves become the context for the implementation&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is one of the most effective strategies because tests are unambiguous specifications. Claude Code does not need to interpret your prose when it has concrete pass/fail criteria.&lt;/p&gt;
&lt;h3&gt;The Progressive Codebase Understanding Pattern&lt;/h3&gt;
&lt;p&gt;When onboarding Claude Code to a new project:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with CLAUDE.md covering the basics (stack, structure, commands)&lt;/li&gt;
&lt;li&gt;Ask Claude to explore the codebase and describe what it finds&lt;/li&gt;
&lt;li&gt;Correct any misunderstandings and add clarifications to CLAUDE.md&lt;/li&gt;
&lt;li&gt;Gradually delegate more complex tasks as the context matures&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This iterative approach builds a robust CLAUDE.md faster than trying to write everything from scratch.&lt;/p&gt;
&lt;h3&gt;The Multi-Agent Feature Pattern&lt;/h3&gt;
&lt;p&gt;For large features with independent components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use &lt;code&gt;/agent&lt;/code&gt; to spawn a sub-agent for each component&lt;/li&gt;
&lt;li&gt;Main agent: coordinates the overall architecture&lt;/li&gt;
&lt;li&gt;Sub-agent 1: implements the database layer&lt;/li&gt;
&lt;li&gt;Sub-agent 2: implements the API endpoints&lt;/li&gt;
&lt;li&gt;Sub-agent 3: implements the frontend components&lt;/li&gt;
&lt;li&gt;Main agent: integrates the results and runs full tests&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each sub-agent operates with focused context, producing better results than one agent trying to build everything sequentially.&lt;/p&gt;
&lt;h3&gt;The Code Review Pattern&lt;/h3&gt;
&lt;p&gt;Use Claude Code as a reviewer before submitting your own PRs:&lt;/p&gt;
&lt;p&gt;&amp;quot;Review the changes in the current branch compared to main. Check for: security issues, performance problems, missing error handling, test coverage gaps, and style guide violations from CLAUDE.md.&amp;quot;&lt;/p&gt;
&lt;p&gt;The persistent CLAUDE.md context means the review applies your project&apos;s specific standards, not generic best practices.&lt;/p&gt;
&lt;h2&gt;When to Choose Claude Code Over Other Tools&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Code over Claude Web/Desktop when:&lt;/strong&gt; Your task is code-centric and benefits from direct file system access, terminal command execution, and test running.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Code over OpenAI Codex when:&lt;/strong&gt; You prefer a terminal-native interactive workflow over Codex&apos;s sandbox-and-PR model, or your project uses the Claude model family.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Code over Cursor or Windsurf when:&lt;/strong&gt; You want a lightweight terminal agent without the overhead of a full IDE, or you work primarily in the terminal.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No CLAUDE.md.&lt;/strong&gt; Claude Code still works without one, but it will make assumptions about your project that may not match reality. Ten minutes spent writing CLAUDE.md saves hours of corrections.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stale CLAUDE.md.&lt;/strong&gt; A CLAUDE.md that references a framework you migrated away from six months ago actively misleads the agent. Keep it current.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using /clear.&lt;/strong&gt; Long sessions accumulate noise. Clear the conversation when switching tasks or when the agent seems to be losing focus.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-relying on MCP.&lt;/strong&gt; If Claude Code can accomplish a task through direct file access and terminal commands, adding an MCP server is unnecessary overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring MEMORY.md.&lt;/strong&gt; Review it periodically. Claude Code&apos;s auto-generated memories are usually accurate, but occasionally they capture outdated or incorrect information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Micro-managing the agent.&lt;/strong&gt; Claude Code is designed for autonomous task execution. Give it a clear objective, ensure the context is correct, and let it work. Interrupting with constant corrections breaks the agent&apos;s flow.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude CoWork: A Complete Guide for Knowledge Workers</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-cowork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-cowork/</guid><description>
Claude CoWork represents a fundamentally different approach to AI context management. Unlike chat interfaces where you send messages and receive resp...</description><pubDate>Sat, 07 Mar 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude CoWork represents a fundamentally different approach to AI context management. Unlike chat interfaces where you send messages and receive responses, CoWork is an autonomous agent that works on your local machine, reads and writes files directly, and executes multi-step tasks with minimal supervision. For knowledge workers who spend their days in documents, spreadsheets, and presentations, CoWork replaces the constant back-and-forth of copy-paste workflows with direct delegation.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in CoWork, from setting up folder-level instructions to creating reusable workflows that run on schedule.&lt;/p&gt;
&lt;h2&gt;How CoWork Differs from Other Claude Interfaces&lt;/h2&gt;
&lt;p&gt;CoWork runs as part of the Claude Desktop application but operates in a distinct mode. The differences matter for context management:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Claude Web/Desktop Chat&lt;/th&gt;
&lt;th&gt;Claude CoWork&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interaction model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conversational (you send, it responds)&lt;/td&gt;
&lt;td&gt;Autonomous (you delegate, it executes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upload or MCP server&lt;/td&gt;
&lt;td&gt;Direct local read/write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In the chat window&lt;/td&gt;
&lt;td&gt;On your file system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minutes (conversational)&lt;/td&gt;
&lt;td&gt;Minutes to hours (autonomous)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual only&lt;/td&gt;
&lt;td&gt;Scheduled or on-demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sub-agents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (parallel task decomposition)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Because CoWork operates autonomously on your local files, context management is less about what you say in a conversation and more about how you structure your file system, instructions, and task definitions.&lt;/p&gt;
&lt;h2&gt;Thinking About Context for Autonomous Tasks&lt;/h2&gt;
&lt;p&gt;When delegating to CoWork, the context equation changes. In a chat, you can course-correct in real time. With CoWork, you define the context upfront and the agent executes on its own. This means your context needs to be more complete and more explicit than in conversational interfaces.&lt;/p&gt;
&lt;h3&gt;Before Delegating, Ask:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Does the task have a clear, verifiable end state?&lt;/strong&gt; &amp;quot;Organize these files by date&amp;quot; is clear. &amp;quot;Make these files better&amp;quot; is not.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can I describe the success criteria in writing?&lt;/strong&gt; If you cannot articulate what &amp;quot;done&amp;quot; looks like, CoWork will struggle too.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does CoWork have access to everything it needs?&lt;/strong&gt; Files, folders, reference material, and formatting instructions should all be accessible before you start.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Delegation Spectrum&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Simple delegation (minimal context):&lt;/strong&gt; &amp;quot;Create a summary of every PDF in the /reports folder and save it as summary.md&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Moderate delegation:&lt;/strong&gt; &amp;quot;Generate a weekly status report using the data in /projects/metrics.csv. Follow the format in /templates/weekly-report.md. Save the output to /reports/week-12-report.md&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complex delegation:&lt;/strong&gt; &amp;quot;Research the competitive landscape for product X by reading the documents in /research/competitors/. Create a presentation in PowerPoint format that covers market positioning, pricing comparison, and feature gaps. Use the company brand guidelines in /brand/style-guide.pdf for formatting.&amp;quot;&lt;/p&gt;
&lt;p&gt;Each level requires progressively more context, but all of it is provided through file access and instructions rather than conversation.&lt;/p&gt;
&lt;h2&gt;Global and Folder Instructions&lt;/h2&gt;
&lt;p&gt;CoWork uses a layered instruction system that lets you set context at different scopes.&lt;/p&gt;
&lt;h3&gt;Global Instructions&lt;/h3&gt;
&lt;p&gt;Global instructions apply across all CoWork tasks regardless of which folder or project you are working in. Set these for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your preferred writing style and tone&lt;/li&gt;
&lt;li&gt;Output format preferences (bullet points vs. prose, heading structure)&lt;/li&gt;
&lt;li&gt;General constraints (&amp;quot;Always use metric units,&amp;quot; &amp;quot;Write in American English&amp;quot;)&lt;/li&gt;
&lt;li&gt;Your role and expertise level&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These function similarly to Custom Instructions in ChatGPT but are specific to CoWork&apos;s autonomous execution mode.&lt;/p&gt;
&lt;h3&gt;Folder Instructions&lt;/h3&gt;
&lt;p&gt;Folder-level instructions apply when CoWork operates within a specific directory. This is where context management gets powerful. You can create different instruction sets for different projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/work/project-alpha/&lt;/code&gt; might have instructions about project-specific terminology and formatting&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/work/blog-drafts/&lt;/code&gt; might have instructions about your blog style guide and target audience&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/work/financial-reports/&lt;/code&gt; might have instructions about compliance requirements and number formatting&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Folder instructions override global instructions when they conflict, giving you precise control over CoWork&apos;s behavior in each context.&lt;/p&gt;
&lt;h3&gt;Writing Effective Instructions&lt;/h3&gt;
&lt;p&gt;Focus on what CoWork needs to know to complete tasks autonomously:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Project Context

This folder contains marketing materials for Product X.
Target audience: enterprise IT decision-makers.
Tone: professional, authoritative, not salesy.

## File Organization

- /drafts/ contains work-in-progress documents
- /final/ contains approved, publication-ready content
- /assets/ contains images, charts, and data files
- /templates/ contains formatting templates

## Quality Standards

- All claims must be supported by data from the /assets/ folder
- Final documents must follow the template in /templates/standard.docx
- Run a readability check: target Flesch-Kincaid grade 10-12
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MCP Server Integration&lt;/h2&gt;
&lt;p&gt;CoWork supports MCP (Model Context Protocol) through the Claude Desktop application&apos;s MCP configuration. MCP servers expand what CoWork can access beyond the local file system.&lt;/p&gt;
&lt;h3&gt;Useful MCP Servers for Knowledge Workers&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Google Drive or OneDrive:&lt;/strong&gt; Access cloud-stored documents without downloading them first&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Notion or Confluence:&lt;/strong&gt; Read from and write to your team&apos;s knowledge base&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slack:&lt;/strong&gt; Pull conversation context or post updates about completed tasks&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Calendar:&lt;/strong&gt; Check scheduling context when preparing meeting materials&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Email:&lt;/strong&gt; Draft responses based on incoming email content&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value for CoWork&lt;/h3&gt;
&lt;p&gt;MCP is most valuable when CoWork needs information from systems outside your local file system. If you are creating a report that combines local data with information from your company wiki, an MCP server for that wiki lets CoWork access both sources in a single task.&lt;/p&gt;
&lt;p&gt;However, for purely local tasks (organizing files, generating documents from local data, processing spreadsheets), MCP adds unnecessary complexity. If the data is already on your machine, direct file access is simpler and faster.&lt;/p&gt;
&lt;h2&gt;Scheduled Tasks: Context That Runs Automatically&lt;/h2&gt;
&lt;p&gt;One of CoWork&apos;s distinctive features is task scheduling. You can define tasks that run at specific intervals (daily, weekly, monthly), and CoWork executes them with the same context every time.&lt;/p&gt;
&lt;h3&gt;Use Cases for Scheduled Tasks&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Weekly report generation:&lt;/strong&gt; Compile data from multiple sources into a formatted report every Monday&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Daily email drafts:&lt;/strong&gt; Prepare responses to routine communications based on templates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly file organization:&lt;/strong&gt; Sort and archive documents that have accumulated in download or inbox folders&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data processing:&lt;/strong&gt; Transform incoming CSV exports into formatted spreadsheets at regular intervals&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context for Scheduled Tasks&lt;/h3&gt;
&lt;p&gt;Scheduled tasks need to be fully self-contained. The context must include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Where to find inputs&lt;/strong&gt; (file paths, folders to scan)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What to do with them&lt;/strong&gt; (the processing logic)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where to put outputs&lt;/strong&gt; (destination paths)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What quality checks to apply&lt;/strong&gt; (validation rules)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What to do when something unexpected happens&lt;/strong&gt; (error handling)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because you are not present during execution, the instructions must anticipate edge cases. For example: &amp;quot;If no new files are found in /inbox/, skip processing and do not create an empty report.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Sub-Agent Delegation&lt;/h2&gt;
&lt;p&gt;CoWork can decompose complex tasks into subtasks and execute them in parallel using sub-agents. This is particularly useful for tasks that involve independent workstreams.&lt;/p&gt;
&lt;h3&gt;How Sub-Agents Improve Context Management&lt;/h3&gt;
&lt;p&gt;Instead of providing one massive context for a complex task, CoWork breaks it into smaller, focused contexts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sub-agent 1:&lt;/strong&gt; &amp;quot;Summarize the financial data in /data/q3-financials.csv&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sub-agent 2:&lt;/strong&gt; &amp;quot;Extract key quotes from the customer interviews in /research/interviews/&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sub-agent 3:&lt;/strong&gt; &amp;quot;Create a chart comparing year-over-year growth using the data in /data/growth.csv&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each sub-agent gets a focused context, which typically produces better results than one agent trying to handle everything.&lt;/p&gt;
&lt;h3&gt;Monitoring Sub-Agent Progress&lt;/h3&gt;
&lt;p&gt;CoWork surfaces its reasoning and progress as it works. You can observe the plan, see which sub-agents are active, and intervene if something goes off track. This transparency is a context management feature itself because it lets you assess whether the agent&apos;s understanding matches your intent before it completes the task.&lt;/p&gt;
&lt;h2&gt;Working with External Documents&lt;/h2&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;CoWork can read PDFs directly from your file system. Use PDFs for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Published specifications and standards&lt;/li&gt;
&lt;li&gt;Research papers and reports from external sources&lt;/li&gt;
&lt;li&gt;Contracts, legal documents, or compliance materials&lt;/li&gt;
&lt;li&gt;Documents you received from others in PDF format&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown Files&lt;/h3&gt;
&lt;p&gt;CoWork excels with Markdown because the structure is unambiguous. Use Markdown for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your own notes, outlines, and instructions&lt;/li&gt;
&lt;li&gt;Style guides and formatting templates&lt;/li&gt;
&lt;li&gt;Context documents you create specifically for CoWork&lt;/li&gt;
&lt;li&gt;Any document you plan to update frequently&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Hybrid Strategy&lt;/h3&gt;
&lt;p&gt;Keep critical reference material as Markdown in well-organized project folders. Use PDFs for external documents you cannot control. This gives CoWork the cleanest possible context for the documents you author and reasonable access to everything else.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Template-Driven Workflow&lt;/h3&gt;
&lt;p&gt;Create a template folder with examples of your desired output format. In your folder instructions, reference these templates. CoWork will pattern-match against them when generating new content.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/project/
  /templates/
    blog-post-template.md
    report-template.md
    email-template.md
  /instructions.md (folder instructions referencing templates)
  /output/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach gives CoWork concrete examples of &amp;quot;what good looks like&amp;quot; for every type of output it might produce.&lt;/p&gt;
&lt;h3&gt;The Progressive Delegation Pattern&lt;/h3&gt;
&lt;p&gt;Start with simple tasks to build confidence in CoWork&apos;s understanding of your context:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; File organization and simple summaries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 2:&lt;/strong&gt; Document generation from templates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 3:&lt;/strong&gt; Multi-source research and synthesis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 4:&lt;/strong&gt; Complex deliverables with scheduled execution&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each phase lets you refine your instructions based on how CoWork interprets them.&lt;/p&gt;
&lt;h3&gt;The Quality Gate Pattern&lt;/h3&gt;
&lt;p&gt;For high-stakes outputs, set up a two-stage workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Stage 1:&lt;/strong&gt; CoWork generates a draft and saves it to &lt;code&gt;/drafts/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stage 2:&lt;/strong&gt; You review the draft and provide feedback&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stage 3:&lt;/strong&gt; CoWork revises based on your feedback and saves to &lt;code&gt;/final/&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pattern combines autonomous execution with human review, giving you the efficiency of delegation without sacrificing quality control.&lt;/p&gt;
&lt;h2&gt;When to Use CoWork vs. Other Claude Interfaces&lt;/h2&gt;
&lt;p&gt;CoWork is not always the right choice. Here is how it compares for different scenarios:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use CoWork when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The task involves creating, transforming, or organizing files on your local machine&lt;/li&gt;
&lt;li&gt;The work can be defined upfront with clear success criteria&lt;/li&gt;
&lt;li&gt;You want to delegate entirely and come back to a finished result&lt;/li&gt;
&lt;li&gt;The task is repeatable and benefits from scheduling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Claude Web when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want an interactive conversation to explore ideas or get feedback&lt;/li&gt;
&lt;li&gt;The task is primarily knowledge-based (brainstorming, research questions, analysis)&lt;/li&gt;
&lt;li&gt;You need artifacts like code demos or documents that persist in a conversation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Claude Desktop chat when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need MCP access to external services during an interactive conversation&lt;/li&gt;
&lt;li&gt;You want Computer Use to interact with desktop applications&lt;/li&gt;
&lt;li&gt;You need the conversational interaction model with live external data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Claude Code when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are working on a software codebase&lt;/li&gt;
&lt;li&gt;You need the agent to navigate code, run tests, and make pull requests&lt;/li&gt;
&lt;li&gt;You want terminal-level interaction with coding-specific tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vague task definitions.&lt;/strong&gt; &amp;quot;Make these documents better&amp;quot; gives CoWork nothing to work with. Specify what &amp;quot;better&amp;quot; means: more concise, better formatted, restructured for a different audience, updated with new data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping folder instructions.&lt;/strong&gt; Without instructions, CoWork uses only global context and its general training. Folder instructions are what make CoWork effective for your specific workflow.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-scoping tasks.&lt;/strong&gt; A single task that says &amp;quot;create an entire marketing strategy&amp;quot; is too broad. Break it into research, analysis, drafting, and formatting phases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing outputs.&lt;/strong&gt; CoWork runs autonomously, but that does not mean blindly accepting its output. Always review, especially for scheduled tasks that run without your active oversight.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring the file system.&lt;/strong&gt; CoWork works with files. If your files are disorganized, CoWork&apos;s output will be disorganized. Invest in clean folder structures before delegating.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Underusing sub-agents.&lt;/strong&gt; If a task has independent workstreams, let CoWork decompose it. Trying to force everything into a single linear execution path is slower and produces worse results.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Agentic Analytics on the Apache Lakehouse</title><link>https://iceberglakehouse.com/posts/2026-03-07-agentic-analytics/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-agentic-analytics/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation: History, Purpose, and Process](/posts/2026-03-07-ap...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation: History, Purpose, and Process&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you grant a Large Language Model direct access to a raw Amazon S3 bucket filled with Parquet files, it will fail to answer your business questions. AI agents possess immense processing power, but they lack inherent business knowledge.&lt;/p&gt;
&lt;p&gt;To execute agentic analytics safely and accurately, an AI agent requires three things: deep business context, universal governed access, and interactive speed. The Apache open-source data lakehouse stack provides the foundation for those requirements, but you must bridge the gap between raw data and machine intelligence.&lt;/p&gt;
&lt;h2&gt;The Hallucination Trap&lt;/h2&gt;
&lt;p&gt;Consider a raw data table containing a column named &lt;code&gt;cst_act_flg&lt;/code&gt;. A human analyst working at the company for five years knows this stands for &amp;quot;Customer Account Flag.&amp;quot; An AI agent does not. If a user asks the agent to &amp;quot;Show me active customers,&amp;quot; the agent guesses meaning from the abbreviation. Guessing leads directly to hallucinations.&lt;/p&gt;
&lt;p&gt;Raw data lakes optimize for machine storage, not semantic understanding. To prevent hallucinations, you must teach the AI your specific business language.&lt;/p&gt;
&lt;h2&gt;Teaching AI with the Semantic Layer&lt;/h2&gt;
&lt;p&gt;The semantic layer acts as a translation layer between technical schemas and business logic. It provides the context that transforms a generic LLM into an accurate agentic analyst.&lt;/p&gt;
&lt;p&gt;In the Dremio platform, the Semantic Layer is built through Virtual Datasets. Engineers create logical views that rename &lt;code&gt;cst_act_flg&lt;/code&gt; to &lt;code&gt;Active_Customer_Status&lt;/code&gt;. Dremio takes this a step further by using generative AI to automatically document these datasets. By sampling table data and analyzing schemas, Dremio generates detailed Wikis and Tags for your Apache Iceberg tables.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/semantic-layer-translation.png&quot; alt=&quot;The Semantic layer translating raw Iceberg datasets into AI-ready business context&quot;&gt;&lt;/p&gt;
&lt;p&gt;When an AI agent receives a user prompt, it first reads these semantic Wikis. The documentation effectively teaches the AI agent the definitions of your specific business metrics before it attempts to write SQL, ensuring remarkably high accuracy.&lt;/p&gt;
&lt;h2&gt;Autonomous Reflections: AI Accelerating AI&lt;/h2&gt;
&lt;p&gt;Agentic analytics creates a massive new compute burden. When executives and business lines can ask natural language questions, the volume of unpredictable SQL queries skyrockets. Human database administrators cannot manually tune indexes or write materialized views fast enough to support this scale.&lt;/p&gt;
&lt;p&gt;You need AI to accelerate AI. Dremio tackles this with Autonomous Reflections. The platform continuously monitors query patterns: originating from both humans and AI agents, over a seven-day rolling window.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/autonomous-reflections.png&quot; alt=&quot;Autonomous Reflections lifecycle showing query monitoring, background creation, and query acceleration&quot;&gt;&lt;/p&gt;
&lt;p&gt;When Dremio identifies a bottleneck, it automatically acts. It creates, maintains, and drops &amp;quot;Reflections&amp;quot; (pre-computed, highly optimized Iceberg materializations of the data) entirely in the background. Performance becomes an automated byproduct of the architecture, rather than a manual engineering chore.&lt;/p&gt;
&lt;h2&gt;Text-to-SQL and Native AI Functions&lt;/h2&gt;
&lt;p&gt;With context and speed resolved, users can interact directly with the agentic interfaces. Dremio includes a built-in AI Agent capable of discovering datasets, exploring relationships, and visualizing answers. Because the agent is grounded in the AI Semantic Layer and the open Apache Polaris catalog, Text-to-SQL translations actually hit the right tables.&lt;/p&gt;
&lt;p&gt;But agentic analytics is not limited to text-to-SQL. Dremio exposes LLM capabilities directly inside the SQL engine itself.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/ai-sql-functions.png&quot; alt=&quot;AI SQL Function executing inside a Dremio query against Parquet data&quot;&gt;&lt;/p&gt;
&lt;p&gt;Using native AI SQL functions like &lt;code&gt;AI_CLASSIFY&lt;/code&gt; or &lt;code&gt;AI_GENERATE&lt;/code&gt;, analysts can run sentiment analysis on unstructured product reviews directly within a standard &lt;code&gt;SELECT&lt;/code&gt; statement. This eliminates the need to export data into external Python pipelines just to leverage modern generative AI models.&lt;/p&gt;
&lt;h2&gt;The Fully Realized Agentic Lakehouse&lt;/h2&gt;
&lt;p&gt;This 7-part series mapped the evolution of the modern data architecture.&lt;/p&gt;
&lt;p&gt;It starts with the strict vendor-neutral governance of the Apache Software Foundation. You store data highly compressed using Apache Parquet. You map those files into relational, transactional tables using Apache Iceberg. You expose those tables to multiple engines securely using Apache Polaris. You execute queries with zero-copy, in-memory speed using Apache Arrow.&lt;/p&gt;
&lt;p&gt;Finally, you layer the semantic context and Autonomous Reflections over that stack to create the Agentic Lakehouse.&lt;/p&gt;
&lt;p&gt;You can build this stack yourself, or you can use a unified platform. Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead. &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is Apache Arrow? Erasing the Serialization Tax</title><link>https://iceberglakehouse.com/posts/2026-03-07-apache-arrow/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-apache-arrow/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation: History, Purpose, and Process](/posts/2026-03-07-ap...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation: History, Purpose, and Process&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you pull a million records from a database into a Python notebook, the query runs instantly, but the transfer feels endlessly slow. Your compute engine wastes the majority of that time quietly translating data layouts.&lt;/p&gt;
&lt;p&gt;Historically, moving data between two analytical systems required paying a massive &amp;quot;serialization tax.&amp;quot; Apache Arrow eliminates that tax by establishing a universal, open-source standard for how computer memory holds columnar data.&lt;/p&gt;
&lt;h2&gt;The Hidden Cost of Moving Data&lt;/h2&gt;
&lt;p&gt;When an analytical system queries legacy architectures via JDBC or ODBC, it encounters a severe bottleneck. The database holds data in its own proprietary layout. To send the data over a network, the database must serialize it - converting it into a generic row-based format like a JSON array or a proprietary buffer stream.&lt;/p&gt;
&lt;p&gt;When the receiving system (like a pandas DataFrame or a Spark cluster) catches the stream, it must deserialize the rows. It reads the row, pulls out the individual strings and integers, and places them into its own internal columnar arrays for processing.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/serialization-tax.png&quot; alt=&quot;Diagram showing the serialization tax burning CPU cycles while translating data between languages&quot;&gt;&lt;/p&gt;
&lt;p&gt;This cycle of formatting, converting, and parsing consumes up to 80% of the CPU time in data workflows. It slows down queries, burns compute credits, and bottlenecks machine learning pipelines.&lt;/p&gt;
&lt;h2&gt;The Standardized In-Memory Format&lt;/h2&gt;
&lt;p&gt;Apache Arrow changes the physics of data movement. While Apache Parquet defines how to store columnar data on a slow hard drive, Arrow defines how to structure columnar data inside high-speed RAM.&lt;/p&gt;
&lt;p&gt;Arrow provides a standardized, language-agnostic, in-memory columnar format. Whether your system uses Java, Python, C++, or Rust, it structures the data identically in memory. Because the format is columnar, it natively supports vectorization. Modern CPUs can use Single Instruction, Multiple Data (SIMD) hardware acceleration to process entire chunks of Arrow arrays in a single clock cycle.&lt;/p&gt;
&lt;h2&gt;Zero-Copy Sharing&lt;/h2&gt;
&lt;p&gt;Standardizing the memory layout unlocks Arrow&apos;s most powerful trait: Zero-Copy data sharing.&lt;/p&gt;
&lt;p&gt;Imagine a Java-based query engine and a Python-based data science tool running on the same machine. In a pre-Arrow world, the Java tool translates its data to a middle format, hands it to Python, and Python copies it into a new memory space. It doubles the memory footprint and wastes time.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/zero-copy-arrow.png&quot; alt=&quot;Zero-Copy architecture showing two different languages pointing to the exact same memory buffer&quot;&gt;&lt;/p&gt;
&lt;p&gt;With Apache Arrow, both tools understand the exact same memory layout. The Java engine creates an Arrow buffer in RAM. When Python asks for the data, Java simply hands Python the memory address pointer. Python begins reading the data instantly. Zero serialization. Zero copying.&lt;/p&gt;
&lt;h2&gt;Taking Flight: Arrow over the Network&lt;/h2&gt;
&lt;p&gt;Arrow&apos;s speed is not restricted to single machines. The project introduced Arrow Flight, a high-performance Remote Procedure Call (RPC) protocol for transmitting large datasets across networks.&lt;/p&gt;
&lt;p&gt;Instead of converting data to REST or row-based streams, Arrow Flight transports the native Arrow memory buffers directly over the wire. The receiving client gets the buffer and immediately begins executing analytics on it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/arrow-flight-rpc.png&quot; alt=&quot;Arrow Flight RPC versus traditional REST/ODBC protocols over a network&quot;&gt;&lt;/p&gt;
&lt;p&gt;To finalize the death of the serialization tax, the Apache Arrow community created ADBC (Arrow Database Connectivity). ADBC replaces legacy JDBC and ODBC drivers with an API standard explicitly designed for columnar analytics. ADBC allows databases to deliver native Arrow streams directly to clients, bypassing row-conversion entirely.&lt;/p&gt;
&lt;h2&gt;Arrow on the Lakehouse&lt;/h2&gt;
&lt;p&gt;Apache Arrow is the execution memory moving through the central nervous system of the lakehouse.&lt;/p&gt;
&lt;p&gt;By stacking Parquet for storage, Iceberg for tables, Polaris for metadata routing, and Arrow for memory processing, you create an open data architecture capable of outperforming expensive proprietary data warehouses.&lt;/p&gt;
&lt;p&gt;Dremio co-created Apache Arrow. It uses Arrow natively as its internal execution engine to eliminate the serialization tax that slows down traditional platforms. &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to query your object storage with zero-copy analytics.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is Apache Iceberg? The Table Format Revolution</title><link>https://iceberglakehouse.com/posts/2026-03-07-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-apache-iceberg/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation](/posts/2026-03-07-apache-software-foundation/)
- [P...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you drop ten thousand Parquet files into an S3 bucket, you have a data swamp. You do not have a database. To run SQL queries against those files safely, your engine needs to know exactly which files belong to which table, what the columns are, and which files to ignore. Historically, Apache Hive solved this by tracking directories. Apache Iceberg solves this by tracking files.&lt;/p&gt;
&lt;p&gt;That shift from directory-listing to file-level metadata fundamentally changes how organizations scale analytics. Iceberg brings the reliability of a transactional database to cloud object storage.&lt;/p&gt;
&lt;h2&gt;The Directory Listing Bottleneck&lt;/h2&gt;
&lt;p&gt;Legacy data architectures treated cloud storage like a local hard drive. If an engine like Hive wanted to read a table, it asked the cloud provider to list all the files inside a specific directory.&lt;/p&gt;
&lt;p&gt;Listing millions of files in Amazon S3 or Google Cloud Storage takes an incredibly long time. Worse, cloud providers aggressively throttle high-frequency listing requests. When concurrent writers update a heavily partitioned Hive table, metadata synchronization operations cause readers to see inconsistent, partial data. Scaling meant hitting a hard wall.&lt;/p&gt;
&lt;p&gt;Iceberg architects recognized that the file system is the wrong place to store database state. They moved the state into a dedicated metadata tree.&lt;/p&gt;
&lt;h2&gt;The Iceberg Metadata Tree Architecture&lt;/h2&gt;
&lt;p&gt;When an engine queries an Iceberg table, it never asks S3 to list directories. File discovery becomes an instant, &lt;code&gt;O(1)&lt;/code&gt; metadata lookup. The architecture works through a strict hierarchy of pointers.&lt;/p&gt;
&lt;p&gt;The query begins at the &lt;strong&gt;Catalog&lt;/strong&gt;, which holds a single pointer to the current &lt;code&gt;metadata.json&lt;/code&gt; file. This ensures atomic commits; whichever engine successfully updates the catalog pointer wins the transaction. The &lt;code&gt;metadata.json&lt;/code&gt; tracks the table schema and points to a &lt;strong&gt;Manifest List&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/iceberg-metadata-tree.png&quot; alt=&quot;The Iceberg Metadata Tree showing the path from Catalog down to Data Files&quot;&gt;&lt;/p&gt;
&lt;p&gt;The Manifest List acts as a table of contents for a specific point in time (a snapshot). It points to multiple &lt;strong&gt;Manifest Files&lt;/strong&gt;. Finally, these Manifest Files contain the explicit paths to the individual Parquet data files, along with statistics like minimum and maximum values for every column.&lt;/p&gt;
&lt;p&gt;This strict tree structure means the engine knows exactly which Parquet files it needs to read before touching the raw data.&lt;/p&gt;
&lt;h2&gt;Schema and Partition Evolution&lt;/h2&gt;
&lt;p&gt;Data shapes change. In traditional data lakes, renaming a column or changing a partition strategy required a total table rewrite. Iceberg executes these changes in milliseconds as metadata operations.&lt;/p&gt;
&lt;p&gt;Iceberg achieves Schema Evolution by assigning a unique ID to every column. It tracks schema changes against the ID, not the string name. If you delete a column named &lt;code&gt;user_id&lt;/code&gt; and create a new column named &lt;code&gt;user_id&lt;/code&gt;, Iceberg knows they are entirely different fields. You can add, drop, rename, and reorder columns with zero side effects on existing files.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/iceberg-schema-evolution.png&quot; alt=&quot;Diagram showing Schema Evolution mapping unique column IDs to file structures over time&quot;&gt;&lt;/p&gt;
&lt;p&gt;Similarly, Iceberg features &amp;quot;hidden partitioning&amp;quot;. Engineers do not have to create physically derived columns just to partition data (e.g., extracting the year from a timestamp). Iceberg tracks the partition logic entirely in metadata. If you decide to change a table from monthly partitioning to daily partitioning, old data remains partitioned by month, and new data partitions by day. The engine handles the difference transparently.&lt;/p&gt;
&lt;h2&gt;Time Travel and Atomic Snapshots&lt;/h2&gt;
&lt;p&gt;Because Iceberg uses a tree of files where data is never updated in place, every write operation creates a brand new, immutable snapshot of the table.&lt;/p&gt;
&lt;p&gt;When you run an &lt;code&gt;UPDATE&lt;/code&gt; statement, Iceberg writes a new Parquet file containing the updated records, creates a new Manifest pointing to the new data, and generates a new Manifest List. The previous snapshot remains completely intact.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/iceberg-time-travel.png&quot; alt=&quot;Diagram showing Time Travel snapshots pointing an overlapping set of underlying Parquet files&quot;&gt;&lt;/p&gt;
&lt;p&gt;This architecture unlocks Time Travel. Analysts can append &lt;code&gt;FOR SYSTEM_TIME AS OF&lt;/code&gt; to their SQL queries to read previous table states. If a faulty pipeline writes bad data, you do not need to rebuild the table from backups. You simply roll back the catalog pointer to the previous, healthy snapshot. Time travel does not duplicate data; the metadata simply points back to the underlying files that were valid at that exact moment.&lt;/p&gt;
&lt;h2&gt;Scaling the Open Source Lakehouse&lt;/h2&gt;
&lt;p&gt;Apache Iceberg provides the structure necessary to treat raw Parquet files like high-performance relational tables. However, a table format alone is incomplete. You need a centralized catalog mechanism to manage the root pointers, enforce security access, and resolve interoperability between multiple query engines.&lt;/p&gt;
&lt;p&gt;That requirement leads directly to Apache Polaris, the open catalog standard designed to unify the Iceberg ecosystem.&lt;/p&gt;
&lt;p&gt;Dremio executes natively against Iceberg tables, managing the metadata optimization lifecycle automatically. To see Iceberg transactions and time travel in action without building infrastructure, &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;try Dremio Cloud free for 30 days&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is Apache Parquet? Columns, Encoding, and Performance</title><link>https://iceberglakehouse.com/posts/2026-03-07-apache-parquet/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-apache-parquet/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation: History, Purpose, and Process](/posts/2026-03-07-ap...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation: History, Purpose, and Process&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you ask a data analyst to calculate the average transaction amount for the month of July using a massive CSV file, the compute engine must read every single line of that file. It reads the customer name, the address, the item SKUs, and the timestamps, just to find the single column it actually needs. At the petabyte scale, this row-based reading pattern guarantees slow analytics and high compute bills.&lt;/p&gt;
&lt;p&gt;In 2013, engineers at Twitter and Cloudera collaborated to solve this fundamental storage bottleneck. Inspired by Google&apos;s Dremel paper on querying nested data, they created Apache Parquet. Since becoming a top-level project at the Apache Software Foundation in 2015, Parquet has emerged as the baseline storage format for the modern data lakehouse.&lt;/p&gt;
&lt;h2&gt;The Columnar Architecture of Parquet&lt;/h2&gt;
&lt;p&gt;Unlike CSV or JSON files that store data row by row, Apache Parquet heavily reorganizes data horizontally to support parallel analytics.&lt;/p&gt;
&lt;p&gt;When a query engine writes a Parquet file, it horizontally slices the table into &amp;quot;Row Groups&amp;quot; (typically between 128 MB and 1 GB in size). Within each row group, the data is physically stored column by column. A &amp;quot;Column Chunk&amp;quot; holds all the values for a single column within that row group. Finally, the column chunk is split into smaller &amp;quot;Pages,&amp;quot; which serve as the base unit for compression.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/row-vs-columnar-storage.png&quot; alt=&quot;Diagram showing Row-Based vs Column-Based physical storage on disk&quot;&gt;&lt;/p&gt;
&lt;p&gt;This architecture immediately solves the CSV problem through &amp;quot;Column Pruning.&amp;quot; If you run a &lt;code&gt;SELECT&lt;/code&gt; statement targeting only the transaction amount, the query engine completely ignores the chunks containing addresses and names. It only reads the specific column chunks requested. This drastically reduces disk I/O, generating faster query responses and lowering costs.&lt;/p&gt;
&lt;h2&gt;Dictionary Encoding and Compression&lt;/h2&gt;
&lt;p&gt;Data analytics often involves reading repetitive categorizations. Consider a status column containing millions of rows that say either &amp;quot;Active&amp;quot;, &amp;quot;Pending&amp;quot;, or &amp;quot;Cancelled&amp;quot;. Storing those full strings over and over wastes massive amounts of space.&lt;/p&gt;
&lt;p&gt;Parquet handles low-cardinality repetitive data using Dictionary Encoding. Instead of writing &amp;quot;Cancelled&amp;quot; millions of times, Parquet creates a small dictionary in the file&apos;s metadata mapping &amp;quot;Active&amp;quot; to &lt;code&gt;0&lt;/code&gt;, &amp;quot;Pending&amp;quot; to &lt;code&gt;1&lt;/code&gt;, and &amp;quot;Cancelled&amp;quot; to &lt;code&gt;2&lt;/code&gt;. The actual data pages simply store a list of these tiny integers.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/parquet-dictionary-encoding.png&quot; alt=&quot;Diagram of Dictionary Encoding mapping text strings to small integer identifiers&quot;&gt;&lt;/p&gt;
&lt;p&gt;Beyond encoding, columnar storage inherently improves compression. Algorithms like Snappy, Zstd, and GZIP search for repeating patterns to compress data. A column of integers looks incredibly repetitive and compresses tightly. A row containing an integer, a string, a date, and a boolean does not. Storing homogeneous data together allows Parquet files to consume a fraction of the space of their dense CSV equivalents.&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown and Row Group Skipping&lt;/h2&gt;
&lt;p&gt;Perhaps Parquet&apos;s greatest distinct advantage is that its files are entirely self-describing. When a system writes Parquet data, it also computes and stores statistical metadata in the file&apos;s footer.&lt;/p&gt;
&lt;p&gt;The footer contains the minimum value, maximum value, and null counts for every column within every row group. When you issue a query with a filter: like &lt;code&gt;WHERE transaction_amount &amp;gt; 1000&lt;/code&gt;, the query engine reads the footer first. This process is called Predicate Pushdown.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/parquet-predicate-pushdown.png&quot; alt=&quot;Diagram of Predicate Pushdown showing the engine skipping a row group based on min/max stats&quot;&gt;&lt;/p&gt;
&lt;p&gt;If the footer reveals that the highest transaction amount in Row Group 1 is 500, the engine simply skips reading Row Group 1 entirely. The engine only pulls data from row groups containing values that might satisfy the query. This optimization turns broad multi-gigabyte table scans into highly targeted micro-reads.&lt;/p&gt;
&lt;h2&gt;Parquet&apos;s Role in the Open Source Lakehouse&lt;/h2&gt;
&lt;p&gt;Apache Parquet provides the physical storage engine for the data lakehouse. It ensures that data remains highly compressed and brutally efficient to read.&lt;/p&gt;
&lt;p&gt;However, pure Parquet files are immutable. You cannot natively issue an &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; statement against a raw Parquet file to fix a typo. To treat these static, high-performance files like a living, mutating database, you need a table format running on top of them. That is the role of Apache Iceberg.&lt;/p&gt;
&lt;p&gt;To experience query execution directly against Parquet data stored in your own object storage, &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;try Dremio Cloud free for 30 days&lt;/a&gt;. Dremio&apos;s vectorized query engine reads Parquet data aggressively, allowing you to ask questions in plain English and receive instant analytical results.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is Apache Polaris? Unifying the Iceberg Ecosystem</title><link>https://iceberglakehouse.com/posts/2026-03-07-apache-polaris/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-apache-polaris/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation: History, Purpose, and Process](/posts/2026-03-07-ap...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation: History, Purpose, and Process&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Treating thousands of Parquet files as a unified database table requires a brain. Apache Iceberg provides the metadata structure to do this, but the Iceberg specification alone does not spin up a server, manage security roles, or handle network requests. You need a catalog service to orchestrate those root metadata pointers.&lt;/p&gt;
&lt;p&gt;Until recently, that catalog layer threatened to fragment the entire lakehouse vision. Vendors began building their own proprietary catalogs to track Iceberg tables, trapping users in the exact data silos Iceberg promised to eliminate. Apache Polaris solves that fracture.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Catalog Fragmentation War&lt;/h2&gt;
&lt;p&gt;The core promise of the modern data lakehouse is compute independence. In traditional database architectures, data storage and query processing are bundled within a single proprietary system. You cannot run queries against your data without using the query planner, query optimizer, and execution engine provided by that database vendor. The lakehouse architecture breaks this monopoly by storing data in open, vendor-neutral file formats (such as Apache Parquet) and table formats (such as Apache Iceberg) inside open cloud storage buckets.&lt;/p&gt;
&lt;p&gt;With the files stored openly, you can query your data using any engine: Apache Spark for batch ETL, Apache Flink for real-time streaming, Trino for ad-hoc queries, and Dremio for high-performance interactive business intelligence.&lt;/p&gt;
&lt;p&gt;However, table formats rely on a metadata catalog to track the current state of a table. A catalog acts as a centralized database pointer registry. When a write engine commits a transaction, it writes a new metadata file to object storage and updates the catalog pointer to reference this new file. If multiple engines attempt to write to a table without a shared coordinator catalog, they will write competing metadata files, leading to split-brain states, overwrites, and data corruption.&lt;/p&gt;
&lt;p&gt;As Apache Iceberg gained mass adoption across the enterprise, the catalog layer became a critical strategic battleground. Legacy data platform vendors quickly realized that while they could no longer force customers to store data in proprietary file formats, they could still lock customers into their ecosystems by controlling the catalog.&lt;/p&gt;
&lt;p&gt;Vendors began wrapping Iceberg tables in proprietary catalog managers. Under this setup, if a client engine wanted to query a table, it had to connect to the vendor&apos;s proprietary catalog service. If you wanted to ingest data using Flink and query it using Dremio, you had to build complex sync processes to replicate metadata between the Flink catalog registry and the Dremio catalog registry. If the sync lagged, Trino and Dremio would query stale metadata, producing incorrect query results.&lt;/p&gt;
&lt;p&gt;This metadata synchronization overhead created new data silos. Organizations found themselves managing multiple catalogs, with each compute engine maintaining a separate view of the lakehouse state. The open promise of the lakehouse was compromised. To restore compute-storage independence, the industry required a standardized, open catalog protocol. This standard emerged as the Apache Iceberg REST Catalog specification.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Iceberg REST API Standard&lt;/h2&gt;
&lt;p&gt;The Iceberg community addressed the catalog fragmentation problem by defining a standard REST API specification. Instead of defining a specific backend database implementation, the REST Catalog specification defines the HTTP endpoints, request headers, query parameters, and JSON payloads that clients and servers must use to communicate table metadata.&lt;/p&gt;
&lt;p&gt;This API-first approach changed how compute engines integrate with catalogs. Previously, support for a new catalog required writing custom Java connector classes for every query engine. If you wanted to use a custom database catalog, you had to write and maintain catalog integrations for Spark, Flink, Trino, and Presto.&lt;/p&gt;
&lt;p&gt;Under the REST specification, query engines implement the REST client interface once. Any catalog server that implements the REST HTTP endpoints can serve metadata to any REST-compliant client engine instantly.&lt;/p&gt;
&lt;p&gt;Apache Polaris is a fully featured, open-source backend implementation of this Iceberg REST Catalog specification. It provides a stateless, scalable catalog service that manages table metadata and access control policies while complying with the open API spec.&lt;/p&gt;
&lt;p&gt;Because Polaris adheres strictly to the REST standard, it acts as a universal adapter for the lakehouse. A Python script using PyIceberg can resolve namespace paths, a Spark batch job can write data, and a Dremio query coordinator can perform query planning, all routing requests through a single Polaris catalog endpoint. By serving as a unified metadata registry, Polaris eliminates catalog duplication and ensures that all engines see a consistent, real-time snapshot of table states.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Solution of Open Governance and the ASF&lt;/h2&gt;
&lt;p&gt;In the technology industry, open source is not always synonymous with open governance. A software project can be open source, allowing you to view and download its source code for free, while its roadmap, licensing terms, and release cycles remain controlled by a single commercial vendor. If that vendor decides to change the license of future releases or deprecate integrations that compete with its paid offerings, community users have little recourse.&lt;/p&gt;
&lt;p&gt;To prevent commercial capture of the lakehouse brain, the co-creators of Polaris (Dremio and Snowflake) donated the project to the Apache Software Foundation (ASF) as an incubating project.&lt;/p&gt;
&lt;p&gt;This donation was a critical milestone for the lakehouse ecosystem. The ASF is a non-profit corporation that provides organizational, legal, and financial support for open-source software projects. The foundation operates under a strict model of open governance known as &amp;quot;The Apache Way.&amp;quot; Under this model, project decisions are made by a diverse Project Management Committee (PMC) composed of individual contributors, rather than a single corporate entity. No single vendor can monopolize the project roadmap or restrict access to its integrations.&lt;/p&gt;
&lt;p&gt;Open governance protects enterprise investments in several ways:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Vendor Neutrality&lt;/strong&gt;: The ASF legally owns the Polaris trademark, code repositories, and documentation. No commercial vendor can alter the licensing terms or lock key features behind proprietary tiers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Community-Driven Roadmap&lt;/strong&gt;: Feature priorities are decided through open consensus, ensuring the catalog evolves in a direction that benefits the entire ecosystem rather than a single vendor&apos;s commercial strategy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engine-Agnostic Design&lt;/strong&gt;: Because no single query engine vendor controls the project, Polaris maintains equal integration quality for all engines, preventing favoritism toward specific compute platforms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-Term Viability&lt;/strong&gt;: If a commercial sponsor shifts its focus, the community and the PMC can continue maintaining and developing the project under the ASF umbrella, preventing project abandonment.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By placing Polaris under ASF governance, the community established a neutral foundation for lakehouse metadata management, guaranteeing that the catalog layer remains open and accessible to all.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Deep Dive into Polaris RBAC Hierarchy and Internals&lt;/h2&gt;
&lt;p&gt;Enterprise data platforms require granular security controls to prevent unauthorized access to sensitive datasets. Apache Polaris provides a robust, hierarchical Role-Based Access Control (RBAC) model designed specifically for metadata governance.&lt;/p&gt;
&lt;p&gt;Unlike legacy access control models that secure data based on physical storage paths, Polaris defines privileges at the logical metadata level (catalogs, namespaces, tables, and views). This separation ensures that security policies remain consistent regardless of the compute engine or cloud storage region used to access the data.&lt;/p&gt;
&lt;p&gt;The Polaris RBAC model consists of five key entities:&lt;/p&gt;
&lt;h3&gt;1. Principals&lt;/h3&gt;
&lt;p&gt;A principal is an identity that requests access to catalog resources. In Polaris, principals can represent human users, query engine connections, ETL pipelines, or automated scripts. Each principal is assigned a set of client credentials (a Client ID and Client Secret) used to authenticate via the OAuth2 token endpoint.&lt;/p&gt;
&lt;h3&gt;2. Principal Roles&lt;/h3&gt;
&lt;p&gt;A principal role is a logical grouping of permissions that can be assigned to one or more principals. For example, you can create a principal role named &lt;code&gt;etl_developer&lt;/code&gt; for data engineers and a principal role named &lt;code&gt;business_analyst&lt;/code&gt; for report creators. A principal can be assigned multiple principal roles.&lt;/p&gt;
&lt;h3&gt;3. Catalog Roles&lt;/h3&gt;
&lt;p&gt;A catalog role is a scope-restricted role defined within a specific catalog instance. Catalog roles represent functional access rights to metadata resources, such as &lt;code&gt;sales_read_only&lt;/code&gt; or &lt;code&gt;finance_administrator&lt;/code&gt;. Catalog roles are mapped to Principal Roles to grant actual access to principals.&lt;/p&gt;
&lt;h3&gt;4. Securable Objects&lt;/h3&gt;
&lt;p&gt;Securable objects are the logical resources managed by Polaris. The objects are organized in a strict hierarchical structure:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalog&lt;/strong&gt;: The top-level container (e.g., &lt;code&gt;production_catalog&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Namespace&lt;/strong&gt;: Logical schemas or databases within a catalog (e.g., &lt;code&gt;production_catalog.sales_data&lt;/code&gt;). Namespaces can be hierarchical (e.g., &lt;code&gt;production_catalog.sales_data.invoices&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table&lt;/strong&gt;: The physical datasets containing the data records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;View&lt;/strong&gt;: Saved query definitions that present logical tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Privileges&lt;/h3&gt;
&lt;p&gt;Privileges are the specific actions allowed on securable objects. Polaris supports a detailed set of privileges, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CATALOG_CREATE&lt;/code&gt;: Permission to create new catalog instances.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NAMESPACE_CREATE&lt;/code&gt;: Permission to create namespaces within a catalog.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NAMESPACE_WRITE&lt;/code&gt;: Permission to alter namespace properties.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TABLE_READ&lt;/code&gt;: Permission to resolve table schemas, snapshots, and read underlying data files.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TABLE_WRITE&lt;/code&gt;: Permission to commit new snapshots and write data files.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TABLE_DROP&lt;/code&gt;: Permission to delete tables from the catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The RBAC Mapping Flow&lt;/h3&gt;
&lt;p&gt;To grant a query engine access to a table, Polaris administrators construct a mapping chain:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Principal: spark-executor]
       │ (belongs to)
       ▼
[Principal Role: IngestionEngine]
       │ (mapped to)
       ▼
[Catalog Role: SalesDataWriter]
       │ (granted privilege: TABLE_WRITE on)
       ▼
[Securable Object: catalog.sales.invoices]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This decoupled mapping structure provides significant administrative benefits. If you decide to migrate your data ingestion pipelines from Apache Spark to Apache Flink, you do not need to modify storage bucket policies or table-level permissions. You simply create a new principal for Flink, assign it to the existing &lt;code&gt;IngestionEngine&lt;/code&gt; Principal Role, and the new engine instantly inherits the required write privileges.&lt;/p&gt;
&lt;p&gt;Furthermore, Polaris enforces privilege inheritance down the logical hierarchy. If you grant the &lt;code&gt;SalesDataReader&lt;/code&gt; catalog role the &lt;code&gt;TABLE_READ&lt;/code&gt; privilege at the namespace level (e.g., &lt;code&gt;catalog.sales&lt;/code&gt;), that role automatically inherits the read privilege for all tables and views created within that namespace, simplifying security management for large-scale data lakes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Credential Vending vs. IAM Policy Sprawl&lt;/h2&gt;
&lt;p&gt;Securing a data lakehouse requires managing access to the physical cloud storage buckets (such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage) where the Parquet data files reside. Historically, organizations secured these buckets using one of two methods, both of which introduce severe security vulnerabilities.&lt;/p&gt;
&lt;p&gt;The first method is distributing long-lived cloud access keys (such as AWS Access Keys and Secret Keys) to every query engine and compute cluster. In this model, the Spark configuration, Trino properties files, and developer notebooks are hardcoded with storage credentials. This approach creates a massive security risk. If a single developer notebook is compromised, the long-lived storage credentials are leaked, allowing unauthorized actors to bypass the catalog entirely and read or delete raw files directly from the cloud storage bucket.&lt;/p&gt;
&lt;p&gt;The second method is creating complex, path-based IAM policies (such as AWS IAM Policies) for each compute engine. For instance, the marketing Spark cluster is assigned an IAM role that allows access to &lt;code&gt;s3://my-bucket/marketing/*&lt;/code&gt;, while the finance cluster is restricted to &lt;code&gt;s3://my-bucket/finance/*&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This approach leads to what security architects call &amp;quot;IAM Policy Sprawl.&amp;quot; As the number of tables, departments, and compute engines grows, the number of IAM policies multiplies, creating an administrative bottleneck. Cloud providers enforce strict limits on the size and number of IAM policies, forcing administrators to use overly broad wildcard policies (e.g., &lt;code&gt;s3://my-bucket/*&lt;/code&gt;) to keep up with request volume. This violates the security principle of least privilege.&lt;/p&gt;
&lt;p&gt;Furthermore, path-based IAM policies cannot enforce relational table security. An IAM policy can only restrict access to folder paths; it cannot enforce schema validation, detect snapshot modifications, or prevent a user from reading raw Parquet files directly while bypassing table-level access logs.&lt;/p&gt;
&lt;h3&gt;The Polaris Credential Vending Solution&lt;/h3&gt;
&lt;p&gt;Apache Polaris resolves these security challenges using a process called credential vending. Under this model, compute engines do not hold long-lived storage credentials. Instead, the Polaris catalog server acts as a secure credential broker between the query engines and the cloud storage provider.&lt;/p&gt;
&lt;p&gt;The credential vending sequence proceeds as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: The query engine authenticates with Polaris using its OAuth2 client credentials and requests the metadata location for a specific table (e.g., &lt;code&gt;sales.invoices&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authorization&lt;/strong&gt;: Polaris validates the client&apos;s RBAC mapping, verifying that the principal has the &lt;code&gt;TABLE_READ&lt;/code&gt; privilege for the requested table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Token Request&lt;/strong&gt;: Polaris contacts the cloud provider&apos;s token service (such as AWS STS, Azure Active Directory, or Google Cloud IAM) using its own highly authorized identity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Token Scoping&lt;/strong&gt;: Polaris requests a set of temporary security credentials, attaching a strict session policy that restricts read and write operations to the exact storage folder where the table&apos;s Parquet files reside (e.g., &lt;code&gt;s3://my-bucket/sales/invoices/*&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Token Delivery&lt;/strong&gt;: The cloud token service returns the temporary credentials (which typically expire in 15 minutes) to Polaris.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata and Credential Return&lt;/strong&gt;: Polaris packages the table&apos;s metadata location, schema definition, and the temporary storage credentials into a standard JSON response and returns it to the query engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Direct Storage Access&lt;/strong&gt;: The query engine reads the Parquet data files directly from the storage bucket using the temporary credentials and discards them when the query execution finishes.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;┌──────────────┐         1. GET Table Metadata          ┌─────────────────┐
│              ├───────────────────────────────────────&amp;gt;│                 │
│              │                                        │                 │
│ Query Engine │         6. Return Metadata + Token     │ Apache Polaris  │
│   (Client)   │&amp;lt;───────────────────────────────────────┤ (REST Catalog)  │
│              │                                        │                 │
└──────┬───────┘                                        └────────┬────────┘
       │                                                         │
       │                                   2. Verify RBAC        │ 3. Request
       │                                   &amp;amp; Table Paths         │   Scoped Token
       │ 7. Direct Read/Write                                    │
       │    (Temporary Scoped Token)                             ▼
       │                                                ┌─────────────────┐
       ▼                                                │  Cloud Provider │
┌──────────────┐                                        │  Token Service  │
│ Cloud Object │                                        │    (AWS STS)    │
│   Storage    │&amp;lt;───────────────────────────────────────┤                 │
│  (S3/ADLS)   │          5. Vend Scoped Token          │                 │
└──────────────┘             (Expiry = 15m)             └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This model provides major security improvements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No Permanent Secrets&lt;/strong&gt;: Compute engines never hold long-lived access keys, eliminating the risk of credential leaks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Micro-Segmented Access&lt;/strong&gt;: Access is restricted to the exact folder containing the requested table files. A user running queries in Spark cannot access files in adjacent folders within the same bucket.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relational Integrity&lt;/strong&gt;: Storage access is granted only after Polaris validates the schema and transaction requirements, preventing users from bypassing the catalog metadata layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplified Administration&lt;/strong&gt;: Cloud security administrators only need to manage a single IAM trust relationship for the Polaris server itself, rather than managing hundreds of individual compute engine policies.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By centralizing access control at the metadata catalog layer, Polaris eliminates IAM policy sprawl and provides a secure, audited boundary for cloud data lakes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Polaris in the Agentic Lakehouse&lt;/h2&gt;
&lt;p&gt;As organizations adopt artificial intelligence and automated decision-making workflows, query patterns are shifting. In addition to human analysts running dashboards, platforms are increasingly queried by autonomous AI agents.&lt;/p&gt;
&lt;p&gt;An agentic lakehouse is an architecture where AI agents, powered by Large Language Models (LLMs) and advanced query planners, automatically explore metadata, generate SQL queries, execute analysis, and write results back to the lakehouse storage layer.&lt;/p&gt;
&lt;p&gt;While agentic workflows promise major productivity gains, they introduce unique security and operational risks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Hallucinations&lt;/strong&gt;: An AI agent might generate a malformed or destructive SQL query, such as attempting to write garbage data to a production table or executing drop commands.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Exfiltration&lt;/strong&gt;: If an AI agent has broad storage access, it can query sensitive namespaces, potentially exposing private customer information or proprietary financials.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lack of Auditing&lt;/strong&gt;: Standard data lake setups struggle to track whether a query was executed by a human analyst or an automated AI agent, complicating compliance audits.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Dremio Semantic Integration and Polaris Routing&lt;/h3&gt;
&lt;p&gt;To secure agentic workflows, organizations integrate Polaris with Dremio&apos;s semantic layer. Dremio acts as the intelligent gateway and query execution planner, while Polaris enforces governance and vends storage tokens.&lt;/p&gt;
&lt;p&gt;The architecture operates as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────┐
│   AI Agent /   │
│   LLM Planner  │
└───────┬────────┘
        │ 1. Natural Language Query
        ▼
┌────────────────┐
│ Dremio Semantic│
│     Layer      │
└───────┬────────┘
        │ 2. Authenticate &amp;amp; Resolve Paths
        ▼
┌────────────────┐
│ Apache Polaris │
│ (REST Catalog) │
└───────┬────────┘
        │ 3. Validate RBAC &amp;amp; Vend Temporary Token
        ▼
┌────────────────┐
│ Cloud Storage  │
│  (S3 / ADLS)   │
└────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When an AI agent needs to answer a business question, it sends a natural language query to Dremio. Natural language interface interactions are inherently unpredictable, making security verification crucial. Dremio&apos;s semantic layer maps the request to logical business tables, resolving field-level aliases and translating the request into optimized, standard SQL queries.&lt;/p&gt;
&lt;p&gt;Before executing the query, Dremio contacts Polaris to resolve the base table metadata and verify access permissions. This verification step is completed prior to initiating any compute or storage operations, preventing wasteful resource consumption on unauthorized queries. Polaris evaluates the RBAC policy mapped to the AI agent&apos;s service principal. If the AI agent is not authorized to access the specific namespace or table, Polaris rejects the request at the metadata level, preventing the query from starting and avoiding data exposure.&lt;/p&gt;
&lt;p&gt;If authorized, Polaris vends a temporary storage token scoped strictly to the Parquet files required for the query. Dremio&apos;s distributed executors fetch the data files using the token, perform the query calculations, and return the aggregated results to the AI agent.&lt;/p&gt;
&lt;p&gt;This integrated approach provides critical guardrails:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic Safety&lt;/strong&gt;: AI agents query logical views in Dremio rather than raw files, preventing direct access to physical storage paths.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deterministic Governance&lt;/strong&gt;: Polaris enforces access policies, ensuring that AI agents cannot execute queries beyond their authorized role boundaries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Traceable Audits&lt;/strong&gt;: Because Polaris logs every REST handshake and OAuth token exchange, compliance teams can audit the exact table paths accessed by specific AI agent principals.
By combining Dremio&apos;s query acceleration and semantic definitions with Polaris metadata security, organizations can deploy automated AI agents with confidence, knowing their lakehouse data remains secure.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;Architectural Comparison: Polaris vs. Nessie vs. Hive Metastore&lt;/h2&gt;
&lt;p&gt;When selecting a metadata registry for an Apache Iceberg lakehouse, organizations typically evaluate three primary open-source options: Apache Polaris, Project Nessie, and the legacy Apache Hive Metastore (HMS). Understanding the architectural design trade-offs of each system is critical for choosing the right catalog for your enterprise.&lt;/p&gt;
&lt;h3&gt;1. Apache Hive Metastore (HMS)&lt;/h3&gt;
&lt;p&gt;The Hive Metastore was designed in the early days of Apache Hadoop to map relational table schemas to directories of files in a distributed file system. HMS uses a Thrift-based RPC protocol and persists its catalog state in a relational database (such as PostgreSQL or MySQL).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Mature, widely supported by legacy query engines, and familiar to data platform teams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Scale bottlenecks because Thrift calls are synchronous and heavy. It does not natively support the Iceberg REST API specification, requiring custom client-side connectors. It has no capability for credential vending, meaning query engines must hold long-lived credentials to the physical storage bucket.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Project Nessie&lt;/h3&gt;
&lt;p&gt;Project Nessie is a transaction catalog for Iceberg that brings Git-like version control to data lakes. Nessie tracks catalog state as a commit graph, allowing developers to create branches, merge changes, and roll back table updates.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Native support for branching, merging, and multi-table transactions (e.g., executing ETL in an isolated branch and merging it to the main branch atomically).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Higher operational complexity. Query engines must support the Nessie catalog client to utilize version control features. Nessie does not support native credential vending, leaving storage-level access control to be managed externally via cloud IAM roles or credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Apache Polaris&lt;/h3&gt;
&lt;p&gt;Apache Polaris is built from the ground up as a stateless, highly scalable metadata catalog implementing the standard Iceberg REST Catalog specification.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Direct adherence to the open REST API standard, guaranteeing immediate compatibility with all modern query engines. Native credential vending protects the object storage layer from credential leakage. The fine-grained RBAC model simplifies metadata governance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: It does not natively support Git-like data versioning (branching and merging) at the catalog level.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;Deep Dive: The OAuth2 Client Credentials Handshake&lt;/h2&gt;
&lt;p&gt;To understand how query engines secure their sessions when interacting with Apache Polaris, we can walk through the OAuth2 token exchange handshake. This protocol ensures that access keys are short-lived and tied to specific principal roles.&lt;/p&gt;
&lt;h3&gt;1. The Token Request&lt;/h3&gt;
&lt;p&gt;When a query engine starts up, it initiates the connection by executing an HTTP POST request to the token endpoint. The engine transmits its client ID and client secret, requesting a bearer token.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X POST https://polaris.example.com/api/catalog/v1/oauth/tokens \
  -H &amp;quot;Content-Type: application/x-www-form-urlencoded&amp;quot; \
  -d &amp;quot;grant_type=client_credentials&amp;quot; \
  -d &amp;quot;client_id=principal_client_id_123&amp;quot; \
  -d &amp;quot;client_secret=principal_client_secret_abc&amp;quot; \
  -d &amp;quot;scope=PRINCIPAL_ROLE:data_engineer_role&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. The Token Validation and Response&lt;/h3&gt;
&lt;p&gt;The Polaris catalog server intercepts this request, validates the client credentials against its database, and verifies that the requested principal role matches the configurations. If valid, Polaris generates a cryptographically signed JSON Web Token (JWT) representing the session.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;access_token&amp;quot;: &amp;quot;eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJwb2xhcmlzIiwic3ViIjoicHJpbmNpcGFsXzEyMyIsImV4cCI6MTcxNjM4OTkwMCwicm9sZXMiOlsiZGF0YV9lbmdpbmVlcl9yb2xlIl19.signature&amp;quot;,
  &amp;quot;token_type&amp;quot;: &amp;quot;bearer&amp;quot;,
  &amp;quot;expires_in&amp;quot;: 3600
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Subsequent API Requests&lt;/h3&gt;
&lt;p&gt;The query engine extracts the returned &lt;code&gt;access_token&lt;/code&gt; and caches it locally. For all subsequent metadata requests (such as listing tables or loading schema snapshots), the engine includes this token in the Authorization header of the HTTP requests:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET https://polaris.example.com/api/catalog/v1/catalogs/sales_warehouse/namespaces/analytics/tables \
  -H &amp;quot;Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the token expires or is rejected by the server, the client engine automatically restarts the OAuth2 handshake, ensuring continuous catalog connectivity without requiring human administrative intervention.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Ecosystem Best Practices&lt;/h2&gt;
&lt;p&gt;To successfully deploy and manage Apache Polaris at scale, data engineering teams should adhere to the following architectural best practices:&lt;/p&gt;
&lt;h3&gt;1. Catalog Segmentation&lt;/h3&gt;
&lt;p&gt;Avoid registering all company tables in a single catalog instance. Instead, segment catalogs based on organizational boundaries (e.g., &lt;code&gt;finance_catalog&lt;/code&gt;, &lt;code&gt;marketing_catalog&lt;/code&gt;, &lt;code&gt;sales_catalog&lt;/code&gt;) or environments (&lt;code&gt;dev_catalog&lt;/code&gt;, &lt;code&gt;staging_catalog&lt;/code&gt;, &lt;code&gt;prod_catalog&lt;/code&gt;). This separation isolates metadata boundaries and reduces the blast radius of administrative errors.&lt;/p&gt;
&lt;h3&gt;2. Multi-Level Namespace Hierarchies&lt;/h3&gt;
&lt;p&gt;Structure your namespaces logically to take advantage of Polaris RBAC inheritance. A recommended hierarchy is:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;catalog.environment.department.dataset_name&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;prod_catalog.analytics.finance.quarterly_invoices&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;By organizing namespaces this way, you can grant read access to the entire &lt;code&gt;finance&lt;/code&gt; namespace to finance analysts, and they will automatically inherit read rights for new tables added to that namespace without manual intervention.&lt;/p&gt;
&lt;h3&gt;3. Tuning Token Time-To-Live (TTL)&lt;/h3&gt;
&lt;p&gt;Tuning token lifetimes is a balance between security and performance. Shorter TTLs (e.g., 5 to 15 minutes) minimize the window of exposure for vended storage credentials, but they force compute engines to contact Polaris more frequently to refresh tokens, increasing API load. For high-volume streaming ingest workloads, set token TTLs between 30 and 60 minutes to minimize handshake overhead. For ad-hoc analytics and developer notebooks, keep token TTLs short (15 minutes or less) to maximize security.&lt;/p&gt;
&lt;h3&gt;4. Back-End Database Replication and Backup&lt;/h3&gt;
&lt;p&gt;In production deployments, Polaris containers are stateless. The state of your catalogs, namespaces, roles, and table pointers is stored in the backing relational database (configured via EclipseLink JDBC). Treat this database as a tier-one production system. Implement regular backups, enable multi-region read replicas, and configure automated failover to prevent catalog outages from disabling your entire query engine infrastructure.&lt;/p&gt;
&lt;h3&gt;5. Client-Side Caching Configuration&lt;/h3&gt;
&lt;p&gt;Configure query engines (Spark, Trino, Dremio) to cache metadata locally during query planning phases. While the REST API is fast, making repeated HTTP calls to Polaris for every sub-task in a distributed query plan can saturate network interfaces. Client-side metadata caching reduces catalog latency and improves overall query compilation times.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Polaris represents a major advancement in the maturity of the modern data lakehouse. By providing a production-grade, vendor-neutral implementation of the Iceberg REST Catalog specification, Polaris prevents catalog lock-in and guarantees true compute-storage independence.&lt;/p&gt;
&lt;p&gt;Its robust role-based access control, secure credential vending mechanism, and seamless integration with high-performance query engines like Dremio ensure that organizations can govern their datasets without compromising query speeds. As data platforms transition to automated, agent-driven architectures, a centralized, open metadata brain like Polaris becomes an essential pillar for secure, scalable analytics.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s built-in Open Catalog is built natively on Apache Polaris. When you sign up, you get a production-ready, vendor-neutral Polaris catalog deployed instantly. &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to query your data without creating proprietary metadata silos.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Software Foundation: History, Purpose, and Process</title><link>https://iceberglakehouse.com/posts/2026-03-07-apache-software-foundation/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-apache-software-foundation/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation](/posts/2026-03-07-apache-software-foundation/)
- [P...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you build a modern data lakehouse, you inevitably stack Apache Iceberg, Apache Parquet, and Apache Arrow. These projects dictate how you store, query, and govern petabytes of data. But the code itself is only half the story. The legal and operational framework supporting that code dictates whether a project survives for decades or gets hijacked by a single vendor.&lt;/p&gt;
&lt;p&gt;That framework is the Apache Software Foundation. The ASF provides the structural immunity that prevents any one company from controlling the open source stack. Understanding how the ASF operates helps you evaluate the longevity and neutrality of the tools powering your lakehouse.&lt;/p&gt;
&lt;h2&gt;The Origins of the Apache Software Foundation&lt;/h2&gt;
&lt;p&gt;The web runs on software. In 1995, an informal collective of eight developers began collaborating on patches for the NCSA HTTPd web server. They called themselves the &amp;quot;Apache Group.&amp;quot; Their work eventually became the Apache HTTP Server, which powered the early internet expansion.&lt;/p&gt;
&lt;p&gt;As the software gained massive corporate adoption, the group faced a structural problem. An informal collective cannot legally hold copyrights, accept corporate donations, or shield individual volunteer developers from lawsuits.&lt;/p&gt;
&lt;p&gt;To solve this, the group incorporated the Apache Software Foundation in 1999 as a U.S. 501(c)(3) non-profit public charity. The foundation exists to provide software for the public good. It acts as an independent legal shield that takes taking legal and financial ownership so that developers can focus entirely on code. Today, the ASF stewards hundreds of projects spanning big data, artificial intelligence, and cloud infrastructure.&lt;/p&gt;
&lt;h2&gt;The Apache Way: Community Over Code&lt;/h2&gt;
&lt;p&gt;The ASF operates on a unique philosophy known as &amp;quot;The Apache Way.&amp;quot; The core tenet is simple: a healthy community is more important than good code. A toxic but brilliant contributor poses a greater risk to a project&apos;s survival than a mediocre codebase.&lt;/p&gt;
&lt;p&gt;Meritocracy drives the Apache Way. You cannot buy a seat on a project&apos;s decision-making board. Contributors must earn authority by submitting code, writing documentation, and helping others on the mailing lists.&lt;/p&gt;
&lt;p&gt;Crucially, individuals participate in the ASF as individuals. They do not act as representatives of their employers. This strict firewall prevents corporations from buying influence. Projects make decisions openly on public mailing lists through consensus. If an action is not recorded on the mailing list, it did not happen.&lt;/p&gt;
&lt;h2&gt;The Apache Incubator Process&lt;/h2&gt;
&lt;p&gt;You cannot simply hand an existing codebase to the ASF and declare it an Apache project. Every incoming project must pass through the Apache Incubator.&lt;/p&gt;
&lt;p&gt;When a project enters the incubator, it becomes a &amp;quot;podling.&amp;quot; The incubator Project Management Committee assigns experienced Apache members as mentors to guide the podling.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/asf-incubation-process.png&quot; alt=&quot;The Apache Incubation Process flow showing Podling, Mentorship, IP Clearance, and Graduation to TLP&quot;&gt;&lt;/p&gt;
&lt;p&gt;During incubation, the project community must prove they can operate under The Apache Way. They must transition all intellectual property to the ASF, which involves relicensing the code under the permissive Apache License 2.0. They also must demonstrate that their contributor base is diverse and not dominated by a single company.&lt;/p&gt;
&lt;p&gt;Once a podling proves its community is resilient, legally clear, and self-governing, it applies for graduation. The ASF board grants approval, elevating the project to a Top-Level Project (TLP). The project then operates autonomously under its own Project Management Committee.&lt;/p&gt;
&lt;h2&gt;Apache Software Foundation vs. Linux Foundation&lt;/h2&gt;
&lt;p&gt;The ASF and the Linux Foundation frequently appear alongside each other, but they operate under entirely different models. Both are vital to open source software, but they serve different purposes.&lt;/p&gt;
&lt;p&gt;The ASF is a 501(c)(3) public charity focused on grassroots community incubation. The Linux Foundation is a 501(c)(6) trade organization that acts as a consortium for massive industry collaboration.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Feature&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Apache Software Foundation (ASF)&lt;/th&gt;
&lt;th style=&quot;text-align:left&quot;&gt;Linux Foundation (LF)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Organizational Model&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;501(c)(3) charity&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;501(c)(6) trade organization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Members&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Individuals&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Corporations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Decentralized Project Management Committees&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Centralized Technical Steering Committees&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&quot;text-align:left&quot;&gt;&lt;strong&gt;Financial Influence&lt;/strong&gt;&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Financial donors hold zero influence&lt;/td&gt;
&lt;td style=&quot;text-align:left&quot;&gt;Large corporate sponsors often hold structured governance seats&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/asf-vs-lf-comparison.png&quot; alt=&quot;Comparison Diagram of ASF versus Linux Foundation showing individual vs corporate membership&quot;&gt;&lt;/p&gt;
&lt;p&gt;The Linux Foundation excels at gathering competing corporate giants to fund and stabilize core internet infrastructure like Kubernetes. Companies pay membership fees, and those fees often secure them seats on a governing board to help direct the project.&lt;/p&gt;
&lt;p&gt;The ASF strictly prohibits pay-to-play governance. A company can donate millions of dollars to the ASF, but they receive exactly zero influence over any project&apos;s technical direction. Only individual code contributors earn votes.&lt;/p&gt;
&lt;h2&gt;Why ASF Governance Matters for the Lakehouse&lt;/h2&gt;
&lt;p&gt;When you design a data lakehouse, you commit to a storage and query architecture that will last five to ten years. If a single vendor controls your data format, they can change the licensing model, slow down innovation, or force you into expensive proprietary compute engines.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/apache-lakehouse-umbrella.png&quot; alt=&quot;Three layers of the Apache Lakehouse stacked under the ASF Umbrella showing Parquet, Iceberg, and Arrow&quot;&gt;&lt;/p&gt;
&lt;p&gt;By building your stack on Apache Parquet for storage, Apache Iceberg for table formats, and Apache Arrow for memory processing, you mitigate that risk. Because these are Top-Level Projects at the ASF, no single company can hijack their roadmaps.&lt;/p&gt;
&lt;p&gt;The ASF ensures that the standards remain genuinely open. Competing query engines can all integrate with Iceberg and Arrow under equal conditions. Your data stays in your storage, in an open format, accessible by any engine. No lock-in.&lt;/p&gt;
&lt;p&gt;If your team is ready to run analytics on these open standards without manual tuning, start by querying your Iceberg tables centrally. &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to deploy agentic analytics directly on your data lakehouse with zero vendor lock-in.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Assembling the Apache Lakehouse: The Modular Architecture</title><link>https://iceberglakehouse.com/posts/2026-03-07-assembling-apache-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-07-assembling-apache-lakehouse/</guid><description>
_Read the complete Open Source and the Lakehouse series:_

- [Part 1: Apache Software Foundation: History, Purpose, and Process](/posts/2026-03-07-ap...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Read the complete Open Source and the Lakehouse series:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-software-foundation/&quot;&gt;Part 1: Apache Software Foundation: History, Purpose, and Process&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-parquet/&quot;&gt;Part 2: What is Apache Parquet?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-iceberg/&quot;&gt;Part 3: What is Apache Iceberg?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-polaris/&quot;&gt;Part 4: What is Apache Polaris?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-apache-arrow/&quot;&gt;Part 5: What is Apache Arrow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-assembling-apache-lakehouse/&quot;&gt;Part 6: Assembling the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;/posts/2026-03-07-agentic-analytics/&quot;&gt;Part 7: Agentic Analytics on the Apache Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For decades, the standard data architecture was monolithic. When you bought a data warehouse, you bought a single box where the vendor tightly coupled the storage format, the database rules, the metadata catalog, and the compute engine. If you wanted to query your data with a different tool, you had to physically extract the data from the warehouse and pay to store it somewhere else.&lt;/p&gt;
&lt;p&gt;The modular Apache Lakehouse breaks that monolith apart. By using open standards for every defining layer of the data stack, you can decouple your storage from your compute entirely.&lt;/p&gt;
&lt;h2&gt;The Four Pillars of the Open Stack&lt;/h2&gt;
&lt;p&gt;The true power of the modern data lakehouse emerges when you assemble the four foundational open-source components into a single, cohesive architecture.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The Storage Layer (Apache Parquet):&lt;/strong&gt; At the base, you have raw object storage (like Amazon S3 or Google Cloud Storage) filled with highly compressed, columnar Parquet files. This minimizes your storage footprint and guarantees rapid I/O for analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Table Format (Apache Iceberg):&lt;/strong&gt; Because Parquet files are immutable, they cannot function natively as a database. Iceberg sits directly above the storage layer, mapping those files into relational tables. It provides the ACID transactions, schema evolution, and time travel necessary to keep data highly structured.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Governance Layer (Apache Polaris):&lt;/strong&gt; To prevent catalog fragmentation, Polaris acts as the central brain. It securely manages access to the Iceberg tables, using credential vending to ensure that different compute engines can hit the same data safely and transparently via a REST API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Execution Layer (Apache Arrow):&lt;/strong&gt; When a BI dashboard or a query engine needs the data, it processes it in RAM using Apache Arrow. This in-memory columnar format ensures zero-copy reads, eliminating the massive CPU penalties of the legacy serialization tax.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/four-pillar-architecture.png&quot; alt=&quot;Diagram showing the four layers stacked vertically: Parquet, Iceberg, Polaris, and Arrow&quot;&gt;&lt;/p&gt;
&lt;p&gt;This stack ensures complete vendor neutrality. Because every layer relies on an Apache Software Foundation standard, you own your data. You can swap compute engines tomorrow without migrating a single byte.&lt;/p&gt;
&lt;h2&gt;The Trap of the DIY Lakehouse&lt;/h2&gt;
&lt;p&gt;When engineering teams first understand this modular stack, the instinct is to build it manually. They stitch together open-source Spark clusters, deploy standalone Polaris containers, and point everything at their S3 buckets.&lt;/p&gt;
&lt;p&gt;That Do-It-Yourself approach provides absolute control over the infrastructure, but it introduces a massive operational trap.&lt;/p&gt;
&lt;p&gt;Apache Iceberg is incredibly powerful, but it is not self-maintaining. Every time you insert or update rows, Iceberg creates new snapshots, new manifest files, and tiny new Parquet files. If left unchecked, this bloat degrades query performance to a crawl. In a DIY build, your team must manually write, schedule, and monitor heavy Spark jobs to regularly compact small files, rewrite manifests, and vacuum expired snapshots. Your team becomes a database maintenance firm instead of a data analytics firm.&lt;/p&gt;
&lt;h2&gt;The Open Platform Approach&lt;/h2&gt;
&lt;p&gt;The enterprise alternative to a DIY build is a managed, open platform.&lt;/p&gt;
&lt;p&gt;Choosing a managed platform does not violate the &amp;quot;no vendor lock-in&amp;quot; mandate - provided the platform honors the open architecture. Dremio, for example, natively integrates all four of these Apache projects out of the box.&lt;/p&gt;
&lt;p&gt;When you deploy Dremio, you get a fully featured engine running Apache Arrow in its memory layer, querying Apache Iceberg tables stored in Apache Parquet formats, tracked by an internal Apache Polaris catalog.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/2026/apache-lakehouse/diy-vs-platform.png&quot; alt=&quot;Diagram showing an unmanaged DIY cluster versus a unified Platform orchestrating the maintenance&quot;&gt;&lt;/p&gt;
&lt;p&gt;Crucially, Dremio handles the operational burden. Features like Automatic Table Optimization quietly compact files and vacuum expired snapshots in the background, ensuring sub-second query performance without demanding custom maintenance scripts. Because the underlying data remains in open Iceberg REST formats, you are never locked into the execution engine.&lt;/p&gt;
&lt;p&gt;To bypass the engineering headaches of a DIY build and start analyzing data on a production-ready Apache architecture on day one, &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;try Dremio Cloud free for 30 days&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude Desktop: A Complete Guide to MCP, Computer Use, and Local File Access</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-desktop/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-desktop/</guid><description>
Claude Desktop takes everything available in Claude Web and adds three capabilities that fundamentally change how you manage context: MCP server conn...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude Desktop takes everything available in Claude Web and adds three capabilities that fundamentally change how you manage context: MCP server connections that link Claude to external tools and data sources, direct local file access that eliminates the upload-download cycle, and Computer Use that lets Claude interact with your desktop environment. These additions make Claude Desktop the right choice when your work requires live data, local file system access, or integration with tools that Claude Web cannot reach.&lt;/p&gt;
&lt;p&gt;This guide explains how to leverage each of Claude Desktop&apos;s context management features, when to use them, and how they complement the Projects, artifacts, and conversation patterns covered in the Claude Web guide.&lt;/p&gt;
&lt;h2&gt;What Claude Desktop Adds Over Claude Web&lt;/h2&gt;
&lt;p&gt;Claude Desktop shares the same core features as Claude Web: Projects with instructions and knowledge files, artifacts, and the same large context windows (up to 1 million tokens). The key additions are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Claude Web&lt;/th&gt;
&lt;th&gt;Claude Desktop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Projects&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Artifacts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge files&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP servers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local file access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upload only&lt;/td&gt;
&lt;td&gt;Direct read/write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Computer Use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (beta)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your work is purely knowledge-based (writing, research, analysis), Claude Web is sufficient. Switch to Claude Desktop when you need to connect Claude to your local environment or external services.&lt;/p&gt;
&lt;h2&gt;MCP Servers: The Core Differentiator&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is what makes Claude Desktop a genuinely different tool from the web interface. MCP is an open standard that allows Claude to connect to external services, databases, file systems, and tools through standardized server implementations.&lt;/p&gt;
&lt;h3&gt;How MCP Works in Claude Desktop&lt;/h3&gt;
&lt;p&gt;Claude Desktop acts as the MCP host. You configure MCP servers in the application settings, and Claude gains access to the tools those servers expose. When Claude needs information from an external source, it calls the appropriate MCP tool, receives the results, and incorporates them into its response.&lt;/p&gt;
&lt;h3&gt;Practical MCP Use Cases&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Database Access:&lt;/strong&gt;
Connect a database MCP server to let Claude query your development database directly. Instead of copying and pasting query results, Claude can run queries itself:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Explore schema to understand your data model&lt;/li&gt;
&lt;li&gt;Run diagnostic queries when debugging&lt;/li&gt;
&lt;li&gt;Verify data after explaining a migration plan&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;File System Access:&lt;/strong&gt;
Connect a filesystem MCP server to give Claude access to specific directories on your machine. This is especially useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Browsing project directories without manually uploading each file&lt;/li&gt;
&lt;li&gt;Reading configuration files, logs, or data files&lt;/li&gt;
&lt;li&gt;Writing output files (reports, generated code, processed data) directly to disk&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Version Control:&lt;/strong&gt;
Connect a Git MCP server to let Claude interact with your repository:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Review recent commits and diffs&lt;/li&gt;
&lt;li&gt;Understand the project&apos;s change history&lt;/li&gt;
&lt;li&gt;Create branches or commits (with your approval)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;API Integration:&lt;/strong&gt;
Connect MCP servers for services your workflow depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jira or Linear for project management context&lt;/li&gt;
&lt;li&gt;Notion or Confluence for internal documentation&lt;/li&gt;
&lt;li&gt;Slack for team communication context&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Setting Up MCP Servers&lt;/h3&gt;
&lt;p&gt;MCP servers are configured in Claude Desktop&apos;s settings as JSON:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;filesystem&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-filesystem&amp;quot;, &amp;quot;/path/to/project&amp;quot;]
    },
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://user:pass@localhost:5432/mydb&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use MCP when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your task requires data that is not (and should not be) in the conversation or project files&lt;/li&gt;
&lt;li&gt;You need Claude to interact with live systems (databases, APIs, file systems)&lt;/li&gt;
&lt;li&gt;You want Claude to verify its work against real systems&lt;/li&gt;
&lt;li&gt;The data changes frequently and uploading snapshots is impractical&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Do not use MCP when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The task is self-contained (writing, brainstorming, planning)&lt;/li&gt;
&lt;li&gt;You can provide the needed context by pasting or uploading files&lt;/li&gt;
&lt;li&gt;You are working with sensitive production systems (connect to dev/staging only)&lt;/li&gt;
&lt;li&gt;The MCP server adds latency that slows your workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP Security Considerations&lt;/h3&gt;
&lt;p&gt;MCP servers run locally and can access real systems. Best practices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Only connect to development or staging environments, never production&lt;/li&gt;
&lt;li&gt;Use read-only database credentials when possible&lt;/li&gt;
&lt;li&gt;Limit filesystem access to specific directories using the server&apos;s configuration&lt;/li&gt;
&lt;li&gt;Review Claude&apos;s MCP calls before approving actions that modify data&lt;/li&gt;
&lt;li&gt;Use environment variables for credentials rather than hardcoding them in configuration&lt;/li&gt;
&lt;li&gt;Audit your MCP server configurations periodically to remove servers you no longer use&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Choosing the Right MCP Servers&lt;/h3&gt;
&lt;p&gt;Not every project needs every MCP server. Start with the minimum set and add more as your workflow demands:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solo developers:&lt;/strong&gt; Filesystem + database (if applicable)
&lt;strong&gt;Frontend developers:&lt;/strong&gt; Filesystem + browser automation (Playwright)
&lt;strong&gt;Backend developers:&lt;/strong&gt; Filesystem + database + API testing
&lt;strong&gt;Full-stack teams:&lt;/strong&gt; Filesystem + database + Git + project management&lt;/p&gt;
&lt;p&gt;Adding servers you do not actively use wastes Claude&apos;s attention. Each connected server expands the list of available tools Claude must evaluate for every request.&lt;/p&gt;
&lt;h2&gt;Computer Use: Desktop-Level Interaction&lt;/h2&gt;
&lt;p&gt;Computer Use (currently in beta) allows Claude to interact with your desktop environment by capturing screenshots, controlling the mouse, and providing keyboard input. This enables Claude to use applications that do not have APIs or MCP servers.&lt;/p&gt;
&lt;h3&gt;When Computer Use Helps with Context&lt;/h3&gt;
&lt;p&gt;Computer Use is a context-gathering tool in addition to being an interaction tool. Sometimes the easiest way to give Claude context is to let it look at what you are looking at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GUI applications:&lt;/strong&gt; Show Claude your IDE, database tools, or monitoring dashboards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web applications:&lt;/strong&gt; Let Claude navigate internal tools that require authentication&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Design tools:&lt;/strong&gt; Have Claude reference designs in Figma or Sketch directly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spreadsheets:&lt;/strong&gt; Let Claude read complex Excel layouts that do not convert cleanly to CSV&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Practical Workflow&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Ask Claude to take a screenshot of the current screen&lt;/li&gt;
&lt;li&gt;Claude analyzes the visual context and incorporates it into the conversation&lt;/li&gt;
&lt;li&gt;You can direct Claude to interact with specific UI elements&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly useful when the relevant context is in a visual format that is difficult to describe in text.&lt;/p&gt;
&lt;h3&gt;Computer Use Limitations&lt;/h3&gt;
&lt;p&gt;Computer Use is slower than MCP-based interactions because it relies on visual processing rather than structured data exchange. Use it as a fallback for tools that lack MCP servers or APIs, not as your primary context mechanism. For anything that can be done through MCP (database queries, file access, API calls), MCP is faster and more reliable.&lt;/p&gt;
&lt;h2&gt;Local File Access: Eliminating the Upload Cycle&lt;/h2&gt;
&lt;p&gt;Claude Desktop can read from and write to your local file system directly (via MCP filesystem server), eliminating the need to manually upload and download files.&lt;/p&gt;
&lt;h3&gt;Advantages Over Web Uploads&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No file size workarounds:&lt;/strong&gt; Access files of any size without upload limits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Live files:&lt;/strong&gt; Claude reads the current version of a file, not a snapshot uploaded hours ago&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write capability:&lt;/strong&gt; Claude can save outputs directly to your file system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Directory browsing:&lt;/strong&gt; Claude can explore project structures to understand organization&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices for Local File Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scope access narrowly.&lt;/strong&gt; Point the filesystem MCP server at the specific project directory, not your home folder.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use it for exploration.&lt;/strong&gt; Let Claude browse your project structure to build understanding, then focus on specific files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Combine with Projects.&lt;/strong&gt; Use Project instructions to set context and local file access to provide the actual content. This gives Claude both the &amp;quot;how&amp;quot; (instructions) and the &amp;quot;what&amp;quot; (files).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;External Documents: PDFs and Markdown in Claude Desktop&lt;/h2&gt;
&lt;p&gt;Claude Desktop handles external documents the same way as Claude Web: through Project knowledge files and conversation uploads. However, the addition of local file access changes the strategy.&lt;/p&gt;
&lt;h3&gt;The Hybrid Approach&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;For persistent reference material:&lt;/strong&gt; Upload to Project knowledge files (PDFs or Markdown). These are always available in every conversation within the Project.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For working documents:&lt;/strong&gt; Access via the filesystem MCP server. This way Claude reads the live version of your files without requiring re-uploads when content changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For published specifications:&lt;/strong&gt; Upload PDFs to Project knowledge files. These do not change, so the snapshot approach works fine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For your own documentation:&lt;/strong&gt; Keep it in Markdown files on disk and access via MCP. This way both you and Claude are always working with the latest version.&lt;/p&gt;
&lt;h2&gt;Building an Effective Claude Desktop Workflow&lt;/h2&gt;
&lt;h3&gt;Step 1: Set Up Your Project&lt;/h3&gt;
&lt;p&gt;Create a Claude Desktop Project with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project Instructions covering your role, style, constraints, and terminology&lt;/li&gt;
&lt;li&gt;Knowledge files for stable reference material (style guides, specifications, standards)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Configure MCP Servers&lt;/h3&gt;
&lt;p&gt;Add MCP servers for the external systems you work with regularly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Filesystem server pointing at your project directory&lt;/li&gt;
&lt;li&gt;Database server connected to your development database (if applicable)&lt;/li&gt;
&lt;li&gt;Any service-specific MCP servers for tools you use daily&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Use the Right Tool for Each Context Need&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context Need&lt;/th&gt;
&lt;th&gt;Best Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Project conventions and style&lt;/td&gt;
&lt;td&gt;Project Instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable reference documents&lt;/td&gt;
&lt;td&gt;Project Knowledge Files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Current code and config files&lt;/td&gt;
&lt;td&gt;Filesystem MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database state and schema&lt;/td&gt;
&lt;td&gt;Database MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual UI or application state&lt;/td&gt;
&lt;td&gt;Computer Use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-off data or examples&lt;/td&gt;
&lt;td&gt;Paste in conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Step 4: Manage Conversation Threads&lt;/h3&gt;
&lt;p&gt;Even with MCP and local file access, conversation management matters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start new conversations for new topics (Project context persists)&lt;/li&gt;
&lt;li&gt;Use artifacts for important outputs you want to reference later&lt;/li&gt;
&lt;li&gt;Summarize progress when starting fresh threads&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Live Debugging Pattern&lt;/h3&gt;
&lt;p&gt;When debugging an issue:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Let Claude read the relevant source code via filesystem MCP&lt;/li&gt;
&lt;li&gt;Let Claude query the database to check data state&lt;/li&gt;
&lt;li&gt;Let Claude read log files to identify error patterns&lt;/li&gt;
&lt;li&gt;Have a conversation where Claude synthesizes all of this context into a diagnosis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach gives Claude real-time access to a broader context than you could reasonably paste into a conversation.&lt;/p&gt;
&lt;h3&gt;The Document Generation Pipeline&lt;/h3&gt;
&lt;p&gt;For creating documents that reference live data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Claude reads data via MCP (database stats, API responses, configuration)&lt;/li&gt;
&lt;li&gt;Claude generates the document in a conversation&lt;/li&gt;
&lt;li&gt;Claude writes the output directly to a file on disk&lt;/li&gt;
&lt;li&gt;You review and iterate&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This eliminates the copy-paste cycle between Claude and your file system.&lt;/p&gt;
&lt;h3&gt;The Research and Synthesis Pattern&lt;/h3&gt;
&lt;p&gt;For research projects spanning multiple sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload academic papers and specifications as Project knowledge files&lt;/li&gt;
&lt;li&gt;Connect a web-search MCP server for current information&lt;/li&gt;
&lt;li&gt;Use filesystem MCP to read your existing notes and drafts&lt;/li&gt;
&lt;li&gt;Claude synthesizes across all sources, referencing each by name&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Connecting production databases.&lt;/strong&gt; Always use development or staging credentials. Even read-only production access introduces risk.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-scoping filesystem access.&lt;/strong&gt; Do not give Claude access to your entire home directory. Point the filesystem server at the specific project folder.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using MCP for everything.&lt;/strong&gt; If you just need Claude to reference a style guide, upload it to Project knowledge files. MCP is for live, changing data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forgetting Project Instructions.&lt;/strong&gt; MCP and local file access do not replace the need for clear instructions. Claude still needs to know your style, constraints, and output format.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing MCP actions.&lt;/strong&gt; When Claude performs actions through MCP (writing files, running queries), review them. The protocol provides transparency, but you need to exercise your approval authority.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about context management strategies for AI tools and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude Web: A Complete Guide to Projects, Artifacts, and Intelligent Context</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-web/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-web/</guid><description>
Claude&apos;s web interface at claude.ai combines one of the largest context windows in the industry with a structured Project system that makes it genuin...</description><pubDate>Sat, 07 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude&apos;s web interface at claude.ai combines one of the largest context windows in the industry with a structured Project system that makes it genuinely useful for sustained, complex work. While many AI chat interfaces are limited to one-off conversations, Claude Web is designed for ongoing engagement where the AI accumulates understanding of your work over time. The key to unlocking that potential is managing context deliberately rather than treating each conversation as a blank slate.&lt;/p&gt;
&lt;p&gt;This guide covers every context management strategy available in Claude Web, from basic conversation techniques to advanced Project workflows that make Claude function as a persistent research and development partner.&lt;/p&gt;
&lt;h2&gt;How Claude Web Handles Context&lt;/h2&gt;
&lt;p&gt;Claude Web uses the conversation thread as its primary context unit. Every message you send, every response Claude generates, every file you upload, and every artifact Claude creates stays in the conversation&apos;s context window. Models like Claude Sonnet 4.5 and Opus 4.6 support context windows up to 1 million tokens, which means Claude can hold the equivalent of roughly 750,000 words of conversation, documents, and code in memory at once.&lt;/p&gt;
&lt;p&gt;But a large context window does not eliminate the need for context management. In fact, it makes it more important. With 1 million tokens available, it is easy to fill the window with irrelevant information that dilutes Claude&apos;s attention. The goal is not to maximize how much context you provide, but to maximize how relevant that context is.&lt;/p&gt;
&lt;h3&gt;The Context Priority Hierarchy&lt;/h3&gt;
&lt;p&gt;Claude pays the most attention to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;System instructions&lt;/strong&gt; (Project instructions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The most recent messages&lt;/strong&gt; in the conversation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uploaded files&lt;/strong&gt; referenced in the conversation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Earlier conversation history&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This means that if important context appeared 50 messages ago, Claude may not weight it as heavily as something you said in the last 3 messages. Understanding this hierarchy helps you decide when to re-state important constraints versus trusting that Claude still has them in context.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;h3&gt;Quick Questions (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For factual questions, brainstorming, or one-off tasks, just ask. Claude&apos;s training data provides sufficient background for most general-knowledge queries. Adding unnecessary context (&amp;quot;I am a senior engineer with 15 years of experience, and I have a question about Python lists&amp;quot;) wastes tokens and does not improve the response.&lt;/p&gt;
&lt;h3&gt;Focused Work (Moderate Context)&lt;/h3&gt;
&lt;p&gt;For drafting, editing, code review, or analysis, provide the specific material Claude needs to work with. Paste the code you want reviewed, the text you want edited, or the data you want analyzed. State your requirements clearly: what format you want, what constraints apply, what style to follow.&lt;/p&gt;
&lt;h3&gt;Extended Projects (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;For ongoing work spanning multiple conversations, use Claude&apos;s Projects feature. Upload reference documents, set Project instructions, and let Claude maintain continuity across sessions. This is where context management becomes a genuine productivity multiplier.&lt;/p&gt;
&lt;h2&gt;Projects: Claude Web&apos;s Most Powerful Context Tool&lt;/h2&gt;
&lt;p&gt;Projects create persistent workspaces that carry context across conversations. When you create a Project, you define instructions and upload knowledge files that apply to every conversation within that Project.&lt;/p&gt;
&lt;h3&gt;Setting Up a Project&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Projects&lt;/strong&gt; in the Claude sidebar&lt;/li&gt;
&lt;li&gt;Create a new Project with a descriptive name&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;Project Instructions&lt;/strong&gt;: Custom system-level instructions that Claude follows in every conversation within this Project&lt;/li&gt;
&lt;li&gt;Upload &lt;strong&gt;Knowledge Files&lt;/strong&gt;: Documents that Claude can reference across all conversations in the Project&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Project Instructions&lt;/h3&gt;
&lt;p&gt;Project instructions function as a system prompt that persists across every conversation in the Project. This is the most important piece of context you configure, because it shapes every response Claude gives.&lt;/p&gt;
&lt;p&gt;Effective Project Instructions include:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: Data Pipeline Documentation

## Your Role

You are a technical writer helping document a real-time data pipeline
built with Apache Kafka, Apache Flink, and Apache Iceberg.

## Audience

The documentation is for data engineers with 2-5 years of experience
who are familiar with batch ETL but new to stream processing.

## Style Requirements

- Use active voice
- Include code examples in Python and SQL
- Explain concepts before showing implementation
- Each section should be self-contained (readers may jump between sections)

## Terminology

- Use &amp;quot;data pipeline&amp;quot; not &amp;quot;ETL pipeline&amp;quot; or &amp;quot;data flow&amp;quot;
- Use &amp;quot;event&amp;quot; not &amp;quot;message&amp;quot; when referring to Kafka records
- Use &amp;quot;table&amp;quot; not &amp;quot;dataset&amp;quot; when referencing Iceberg tables

## Output Format

- Use H2 for section headers, H3 for subsections
- Include a &amp;quot;Key Takeaways&amp;quot; box at the end of each section
- Code blocks should include language identifiers
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Knowledge Files&lt;/h3&gt;
&lt;p&gt;You can upload various file types as project knowledge:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Research papers, specs, published docs&lt;/td&gt;
&lt;td&gt;Claude extracts text; complex layouts may lose formatting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Style guides, outlines, structured notes&lt;/td&gt;
&lt;td&gt;Cleanest parsing, best for AI consumption&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code files, logs, configuration&lt;/td&gt;
&lt;td&gt;Direct text ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data samples, reference tables&lt;/td&gt;
&lt;td&gt;Claude can analyze and query the data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Images&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Diagrams, screenshots, mockups&lt;/td&gt;
&lt;td&gt;Claude can describe and reference visual content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Use PDFs vs. Markdown&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use PDFs when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You have published documents that already exist in PDF format&lt;/li&gt;
&lt;li&gt;The document includes complex tables, figures, or formatting that matters&lt;/li&gt;
&lt;li&gt;You do not want to spend time converting the document&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Markdown when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are creating a context document specifically for Claude&lt;/li&gt;
&lt;li&gt;You want maximum parsing accuracy (no PDF extraction artifacts)&lt;/li&gt;
&lt;li&gt;The document will be updated frequently&lt;/li&gt;
&lt;li&gt;You care about precise structure (headings, code blocks, lists)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Markdown is the better choice when you have the option. PDF extraction can introduce artifacts: garbled tables, merged paragraphs, lost code formatting. If accuracy matters, convert your reference documents to Markdown.&lt;/p&gt;
&lt;h3&gt;Managing Knowledge Files Effectively&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name files descriptively.&lt;/strong&gt; &amp;quot;api-reference-v3.md&amp;quot; is better than &amp;quot;document.pdf&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add a summary at the top of each file.&lt;/strong&gt; Claude can navigate large files more effectively when they start with an overview.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep files focused.&lt;/strong&gt; Five 20-page documents work better than one 100-page document, because Claude can identify which file is relevant to a specific question.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Remove outdated files.&lt;/strong&gt; Stale information in your knowledge base leads to stale responses.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Artifacts: Context That Claude Creates&lt;/h2&gt;
&lt;p&gt;Artifacts are a distinct Claude Web feature where Claude creates standalone documents, code files, diagrams, or interactive components during a conversation. Unlike regular responses, artifacts persist as discrete objects that you can reference, edit, and reuse.&lt;/p&gt;
&lt;h3&gt;How Artifacts Enhance Context Management&lt;/h3&gt;
&lt;p&gt;Artifacts serve as shared reference points between you and Claude. When Claude creates a code artifact, for example, both of you can reference it by name in subsequent messages. This is more efficient than scrolling through conversation history to find the relevant code block.&lt;/p&gt;
&lt;p&gt;Common artifact types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Code files:&lt;/strong&gt; Complete, runnable code that Claude creates and iterates on&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documents:&lt;/strong&gt; Formatted text (reports, drafts, plans) that can be edited in place&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Diagrams:&lt;/strong&gt; Mermaid or SVG diagrams that visualize architectures or workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive components:&lt;/strong&gt; React components that render in the browser&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Using Artifacts for Context Persistence&lt;/h3&gt;
&lt;p&gt;When working on a complex deliverable, ask Claude to create artifacts for each component. This keeps the working documents visible and accessible without being buried in conversation history. You can then reference specific artifacts (&amp;quot;Update the database schema artifact to include the new user_preferences table&amp;quot;) rather than re-describing what you need.&lt;/p&gt;
&lt;h2&gt;MCP Server Support on Claude Web&lt;/h2&gt;
&lt;p&gt;Claude Web supports MCP (Model Context Protocol) through remote MCP servers. This allows the web interface to connect to external tools and data sources without requiring a local desktop application.&lt;/p&gt;
&lt;h3&gt;How MCP Works on Claude Web&lt;/h3&gt;
&lt;p&gt;To connect a remote MCP server:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Connectors&lt;/strong&gt; in the Claude web interface&lt;/li&gt;
&lt;li&gt;Add a custom connector by providing the remote MCP server&apos;s URL&lt;/li&gt;
&lt;li&gt;The MCP server&apos;s tools become available within your conversations&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Claude Web supports remote MCP servers across all plan tiers (Free, Pro, Max, Team, Enterprise), though free users may have limitations on the number of connections.&lt;/p&gt;
&lt;h3&gt;What MCP Enables on Claude Web&lt;/h3&gt;
&lt;p&gt;With MCP connectors, Claude Web can interact with external services directly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Productivity tools:&lt;/strong&gt; Google Drive, Slack, Asana, monday.com&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developer tools:&lt;/strong&gt; GitHub, Sentry, Linear&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Creative tools:&lt;/strong&gt; Canva, Figma&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom APIs:&lt;/strong&gt; Any service exposed through a remote MCP server&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP Apps&lt;/h3&gt;
&lt;p&gt;Claude Web also supports MCP Apps, an extension of the protocol that allows MCP servers to provide interactive user interfaces directly within the Claude interface. This means tools connected via MCP can render visual components (dashboards, project boards, design canvases) inside your Claude conversation, reducing the need to switch between applications.&lt;/p&gt;
&lt;h3&gt;Claude Web vs. Claude Desktop MCP&lt;/h3&gt;
&lt;p&gt;Claude Web connects to &lt;strong&gt;remote&lt;/strong&gt; MCP servers (cloud-hosted, accessed via URL). Claude Desktop supports both remote and &lt;strong&gt;local&lt;/strong&gt; MCP servers (processes running on your machine via STDIO). If you need to connect to local databases, local file systems, or services that are not exposed to the internet, use Claude Desktop. For cloud-hosted services and APIs, Claude Web&apos;s remote MCP support is sufficient.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Maximum Impact&lt;/h2&gt;
&lt;h3&gt;The Briefing Pattern&lt;/h3&gt;
&lt;p&gt;At the start of a new conversation within a Project, briefly re-state the current focus:&lt;/p&gt;
&lt;p&gt;&amp;quot;We are working on Chapter 3 of the documentation, covering Flink job deployment. The outline is in the project files. I want to draft the section on checkpoint configuration.&amp;quot;&lt;/p&gt;
&lt;p&gt;This grounds Claude immediately without requiring it to search through the full conversation history or project files.&lt;/p&gt;
&lt;h3&gt;The Explicit Reference Pattern&lt;/h3&gt;
&lt;p&gt;When you want Claude to use specific information from your project files, reference them directly:&lt;/p&gt;
&lt;p&gt;&amp;quot;Based on the API reference document I uploaded, write example code that demonstrates the batch ingestion endpoint. Follow the code style shown in the style guide document.&amp;quot;&lt;/p&gt;
&lt;p&gt;Explicit references help Claude prioritize the right source material rather than relying on its general knowledge.&lt;/p&gt;
&lt;h3&gt;The Iterative Refinement Pattern&lt;/h3&gt;
&lt;p&gt;For complex outputs, work in stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Outline first:&lt;/strong&gt; &amp;quot;Create an outline for this section covering X, Y, and Z&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Draft section by section:&lt;/strong&gt; &amp;quot;Write the first section based on the outline&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review and refine:&lt;/strong&gt; &amp;quot;The technical content is good but the tone is too formal. Make it conversational.&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency check:&lt;/strong&gt; &amp;quot;Review the full draft for consistency in terminology and style&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each stage keeps Claude&apos;s focus narrow, which produces better results than asking for a complete deliverable in one shot.&lt;/p&gt;
&lt;h3&gt;Managing Long Conversations&lt;/h3&gt;
&lt;p&gt;Even with a 1-million-token context window, very long conversations can degrade quality. When a conversation starts feeling unfocused:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Start a new conversation&lt;/strong&gt; within the same Project (your files and instructions carry over)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Summarize progress&lt;/strong&gt; at the start of the new conversation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create artifacts&lt;/strong&gt; for important outputs so they are easy to reference in the new thread&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Multi-Perspective Analysis&lt;/h3&gt;
&lt;p&gt;Ask Claude to analyze a problem from multiple angles in a single conversation:&lt;/p&gt;
&lt;p&gt;&amp;quot;First, analyze this architecture from a performance perspective. Then, analyze it from a cost perspective. Finally, analyze it from a maintainability perspective. Structure each analysis as a separate section.&amp;quot;&lt;/p&gt;
&lt;p&gt;This leverages Claude&apos;s large context window to produce comprehensive analysis while keeping the output organized.&lt;/p&gt;
&lt;h3&gt;The Living Document Workflow&lt;/h3&gt;
&lt;p&gt;Use a Project with a master document artifact that Claude updates throughout the engagement:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create an initial artifact (e.g., &amp;quot;Project Plan v1&amp;quot;)&lt;/li&gt;
&lt;li&gt;As work progresses, ask Claude to update the artifact&lt;/li&gt;
&lt;li&gt;The artifact becomes a living record of the project&apos;s evolution&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly effective for research, planning, and documentation work.&lt;/p&gt;
&lt;h3&gt;The Expert Panel Pattern&lt;/h3&gt;
&lt;p&gt;Give Claude multiple &amp;quot;hats&amp;quot; to wear within a Project:&lt;/p&gt;
&lt;p&gt;&amp;quot;In this Project, I want you to evaluate ideas from three perspectives: (1) a cautious security engineer, (2) an enthusiastic product manager, and (3) a pragmatic senior developer. When I present an idea, respond with all three perspectives.&amp;quot;&lt;/p&gt;
&lt;p&gt;This turns a single Claude conversation into a simulated review process.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Projects for project work.&lt;/strong&gt; If you have more than 3 conversations about the same topic, you should be using a Project. Without it, you lose continuity between sessions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uploading too many files without organization.&lt;/strong&gt; Quality beats quantity. Upload the files Claude actually needs, name them well, and include summaries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Project Instructions.&lt;/strong&gt; Many users create Projects but skip the instructions. This is like hiring a consultant but never briefing them. The instructions are the single highest-impact piece of context you can provide.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not starting fresh conversations.&lt;/strong&gt; Long conversations accumulate noise. When you shift to a new subtopic, start a new conversation within the Project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not connecting MCP servers when you need live data.&lt;/strong&gt; Claude Web supports remote MCP servers through Settings &amp;gt; Connectors. If your task requires live connections to cloud services, set up the relevant MCP connectors. For local services not exposed to the internet, use Claude Desktop instead.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about context management strategies for AI tools and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for OpenAI Codex: A Complete Guide Across Browser, CLI, and App</title><link>https://iceberglakehouse.com/posts/2026-03-context-openai-codex/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-openai-codex/</guid><description>
OpenAI Codex is not a chatbot. It is an autonomous software engineering agent that runs tasks in isolated cloud sandboxes, operates across a browser ...</description><pubDate>Sat, 07 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenAI Codex is not a chatbot. It is an autonomous software engineering agent that runs tasks in isolated cloud sandboxes, operates across a browser interface, a command-line tool, and a dedicated macOS app, and can work on multiple tasks in parallel. Because of this architecture, context management in Codex works fundamentally differently from ChatGPT or traditional coding assistants. Instead of conversational context windows, you manage context through persistent configuration files, skill definitions, and project-level instructions that shape how the agent approaches your codebase.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism Codex provides, explains when to use each one, and walks through practical strategies for getting the agent to produce reliable, project-aligned results across all three interfaces.&lt;/p&gt;
&lt;h2&gt;Understanding How Codex Handles Context&lt;/h2&gt;
&lt;p&gt;Codex operates with a large context window (approximately 192,000 tokens), which means it can reason about substantial portions of a codebase in a single task. But context in Codex is not just conversation history. The agent assembles its context dynamically from multiple sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Your repository:&lt;/strong&gt; Codex clones your repo into a sandboxed environment for each task&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AGENTS.md files:&lt;/strong&gt; Persistent instructions that live in your repository&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills:&lt;/strong&gt; Reusable bundles of instructions, templates, and scripts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task prompt:&lt;/strong&gt; Your natural language description of what to do&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Previous interactions:&lt;/strong&gt; In the desktop app, persistent project memory carries context across sessions&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key insight is that most of Codex&apos;s context comes from your repository itself, not from conversational back-and-forth. This makes context management a matter of preparing your repo and configuration files rather than crafting perfect prompts.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Tasks)&lt;/h3&gt;
&lt;p&gt;For simple, self-contained tasks like &amp;quot;add input validation to this function&amp;quot; or &amp;quot;write unit tests for utils.py,&amp;quot; the task prompt and the codebase itself provide sufficient context. Codex will explore the relevant files, understand the patterns, and produce targeted changes. You do not need to provide extensive background.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Targeted Changes)&lt;/h3&gt;
&lt;p&gt;For tasks that require understanding project conventions, architectural decisions, or specific technical requirements, provide that context in your AGENTS.md file or in the task prompt. For example: &amp;quot;Refactor the authentication module to use JWT instead of session cookies. Our API follows REST conventions and uses Express 5 middleware patterns.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Large Features or Ongoing Work)&lt;/h3&gt;
&lt;p&gt;For multi-step features, large refactors, or ongoing development work, invest in Skills and detailed AGENTS.md files. These provide the agent with your coding standards, architectural patterns, testing requirements, and deployment constraints. The desktop app&apos;s persistent project memory also helps here by retaining context across sessions.&lt;/p&gt;
&lt;h2&gt;AGENTS.md: The Foundation of Codex Context&lt;/h2&gt;
&lt;p&gt;AGENTS.md is the most important context management tool for Codex. It is a Markdown file that lives in your repository and provides persistent instructions to the agent. Codex reads AGENTS.md at the beginning of every task.&lt;/p&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;p&gt;Place an &lt;code&gt;AGENTS.md&lt;/code&gt; file at the root of your repository. Codex loads it automatically before starting any task. Think of it as a briefing document that tells the agent everything it needs to know about your project.&lt;/p&gt;
&lt;h3&gt;What to Include&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# AGENTS.md

## Project Overview

This is a Next.js 15 application with a Python FastAPI backend.
The frontend uses TypeScript, Tailwind CSS, and Zustand for state management.
The backend uses SQLAlchemy with PostgreSQL.

## Coding Standards

- Use functional components with hooks (no class components)
- All API endpoints must include input validation using Pydantic
- Write tests for every new function using pytest (backend) and Vitest (frontend)
- Use conventional commit messages: feat:, fix:, refactor:, docs:, test:

## Architecture

- Frontend routes are in src/app/ (App Router)
- API routes are in backend/api/routes/
- Database models are in backend/models/
- Shared types are in shared/types/

## Constraints

- Do not modify the database schema without explicit approval
- Do not add new dependencies without noting them in the PR description
- All environment variables must be documented in .env.example
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hierarchical AGENTS.md Files&lt;/h3&gt;
&lt;p&gt;For monorepos or large projects, you can place AGENTS.md files at different levels:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Root level:&lt;/strong&gt; Global project instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service directories:&lt;/strong&gt; Service-specific conventions (e.g., &lt;code&gt;backend/AGENTS.md&lt;/code&gt;, &lt;code&gt;frontend/AGENTS.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.codex/AGENTS.md&lt;/code&gt; for personal preferences that apply across all projects&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;More specific files supplement (not replace) more general ones. The agent combines all applicable AGENTS.md files when executing a task.&lt;/p&gt;
&lt;h3&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Keep it updated. Stale AGENTS.md instructions lead to stale agent behavior.&lt;/li&gt;
&lt;li&gt;Be specific about constraints. &amp;quot;Follow best practices&amp;quot; is meaningless to an agent. &amp;quot;All database queries must use parameterized statements, never string interpolation&amp;quot; is actionable.&lt;/li&gt;
&lt;li&gt;Include examples of your code style. Show the agent what &amp;quot;good&amp;quot; looks like in your codebase.&lt;/li&gt;
&lt;li&gt;Document your testing strategy. Tell the agent which test framework to use, where tests live, and what coverage expectations you have.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Skills: Reusable Workflow Bundles&lt;/h2&gt;
&lt;p&gt;Skills are a step beyond AGENTS.md. They are reusable bundles that package instructions, code templates, API configurations, and scripts into a single invocable unit. Skills let you codify complex workflows so the agent can execute them reliably.&lt;/p&gt;
&lt;h3&gt;When to Use Skills&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You have a repeatable workflow (deploying to staging, onboarding a new API endpoint, migrating a database)&lt;/li&gt;
&lt;li&gt;The workflow requires multiple steps that need to happen in a specific order&lt;/li&gt;
&lt;li&gt;You want consistency across team members using Codex&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Creating a Skill&lt;/h3&gt;
&lt;p&gt;Skills are defined as structured folders with a manifest file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# SKILL.md

---

name: create-api-endpoint
description: Creates a new REST API endpoint with validation, tests, and documentation

---

## Steps

1. Create the route file in backend/api/routes/
2. Define the Pydantic request/response models in backend/api/schemas/
3. Implement the business logic in backend/services/
4. Write pytest tests in backend/tests/
5. Add the endpoint to the OpenAPI documentation
6. Update the API changelog

## Templates

Use the existing endpoint at backend/api/routes/users.py as the reference pattern.

## Validation

- Run pytest after creating the endpoint
- Verify the OpenAPI spec is valid
- Check that all response codes are documented
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Skills can be invoked explicitly by name or triggered automatically when the agent detects a task that matches the skill&apos;s description.&lt;/p&gt;
&lt;h2&gt;The Three Interfaces: Context Differences&lt;/h2&gt;
&lt;h3&gt;Browser (ChatGPT Sidebar)&lt;/h3&gt;
&lt;p&gt;The browser interface runs Codex from within the ChatGPT web application. Context management here is straightforward:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Repository:&lt;/strong&gt; Select which repo the agent works on&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task prompt:&lt;/strong&gt; Describe what you want done&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AGENTS.md:&lt;/strong&gt; Loaded automatically from the repo&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Results:&lt;/strong&gt; The agent produces a diff or pull request for review&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This interface is best for individual tasks that you want to review before merging. Context is session-scoped; each task gets a fresh sandbox.&lt;/p&gt;
&lt;h3&gt;CLI (Command Line)&lt;/h3&gt;
&lt;p&gt;The Codex CLI (&lt;code&gt;codex&lt;/code&gt;) runs in your terminal and operates on your local codebase. It offers more control over context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Approval modes:&lt;/strong&gt; Choose between Chat (interactive), Agent (approval for writes), and Full Access (autonomous)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers:&lt;/strong&gt; The CLI supports MCP server integration for connecting external tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File references:&lt;/strong&gt; Point the agent at specific files or directories&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Image inputs:&lt;/strong&gt; Pass screenshots or design mockups alongside prompts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive mode:&lt;/strong&gt; Have a conversation with the agent about your codebase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The CLI is the most flexible interface for context management because you can combine AGENTS.md, MCP servers, and direct file references in a single session.&lt;/p&gt;
&lt;h3&gt;Desktop App (macOS)&lt;/h3&gt;
&lt;p&gt;The desktop app is the most powerful interface for sustained work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Persistent project memory:&lt;/strong&gt; The app retains project history and context across sessions, so you do not have to re-establish context every time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-agent orchestration:&lt;/strong&gt; Run multiple agents on different tasks simultaneously, each in its own Git worktree&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual task management:&lt;/strong&gt; See all running and completed tasks in a unified interface&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills management:&lt;/strong&gt; Create, organize, and invoke Skills from the app&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The desktop app is best for ongoing development work where you are regularly delegating tasks to Codex throughout your day.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;The Codex CLI supports the Model Context Protocol (MCP), allowing you to connect external tools and data sources to the agent.&lt;/p&gt;
&lt;h3&gt;What MCP Enables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database access:&lt;/strong&gt; Let the agent query your development database to understand schema and data patterns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Browser automation:&lt;/strong&gt; Connect a Playwright MCP server so the agent can test frontend changes by interacting with a real browser&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API integration:&lt;/strong&gt; Give the agent access to your project management tools, documentation systems, or monitoring dashboards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom tools:&lt;/strong&gt; Build MCP servers that expose your organization&apos;s internal tools to the agent&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;MCP is most valuable when the agent needs information that is not in the repository:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Understanding runtime behavior (logs, database state, API responses)&lt;/li&gt;
&lt;li&gt;Verifying changes against a running application&lt;/li&gt;
&lt;li&gt;Accessing external specifications or documentation&lt;/li&gt;
&lt;li&gt;Interacting with CI/CD systems or deployment tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When NOT to Use MCP&lt;/h3&gt;
&lt;p&gt;For tasks that are purely code-level (refactoring, writing tests, fixing type errors), MCP adds unnecessary complexity. The codebase itself provides sufficient context. Use MCP when the agent needs to interact with the world outside the code.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured through the CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Add a Playwright MCP server for browser testing
codex mcp add playwright

# Add a custom database MCP server
codex mcp add my-db-server --command &amp;quot;node /path/to/db-mcp.js&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;External Documents: When to Use PDFs vs. Markdown&lt;/h2&gt;
&lt;p&gt;Codex primarily operates on code, but there are situations where providing external documents improves results.&lt;/p&gt;
&lt;h3&gt;Use Markdown When:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Writing AGENTS.md or Skills (required format)&lt;/li&gt;
&lt;li&gt;Providing architectural decision records (ADRs)&lt;/li&gt;
&lt;li&gt;Sharing coding standards or style guides&lt;/li&gt;
&lt;li&gt;Documenting API specifications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Markdown is the native format for Codex context. It parses cleanly, supports code blocks, and is version-controllable in Git.&lt;/p&gt;
&lt;h3&gt;Use PDFs When:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Referencing published specifications (RFC documents, protocol specs)&lt;/li&gt;
&lt;li&gt;Sharing design documents with diagrams that do not translate well to Markdown&lt;/li&gt;
&lt;li&gt;Providing compliance or regulatory requirements that exist in PDF form&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In practice, Markdown is almost always the better choice for Codex. If you have a PDF specification, consider extracting the relevant sections into a Markdown file in your repository.&lt;/p&gt;
&lt;h2&gt;Automations: Scheduled Context Processing&lt;/h2&gt;
&lt;p&gt;Codex supports Automations, which are scheduled tasks that run in the background. These allow you to set up recurring agent work that automatically processes your codebase with predefined context.&lt;/p&gt;
&lt;h3&gt;Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily code reviews:&lt;/strong&gt; Schedule the agent to review new PRs every morning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency audits:&lt;/strong&gt; Weekly check for outdated or vulnerable dependencies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation updates:&lt;/strong&gt; Automatically update API documentation after code changes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test maintenance:&lt;/strong&gt; Periodically scan for broken or flaky tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Automations use the same AGENTS.md and Skills context as manual tasks, ensuring consistency between scheduled and ad-hoc work.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Context Layering Strategy&lt;/h3&gt;
&lt;p&gt;Combine multiple context sources for complex tasks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Global AGENTS.md&lt;/strong&gt; (in &lt;code&gt;~/.codex/&lt;/code&gt;): Personal preferences and universal standards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project AGENTS.md&lt;/strong&gt; (in repo root): Project architecture and conventions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Directory AGENTS.md&lt;/strong&gt; (in subdirectories): Component-specific patterns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills:&lt;/strong&gt; Repeatable workflows for common tasks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task prompt:&lt;/strong&gt; The specific thing you want done now&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers:&lt;/strong&gt; Live external data for verification&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each layer adds specificity without overriding the layers above it.&lt;/p&gt;
&lt;h3&gt;The Multi-Agent Pattern&lt;/h3&gt;
&lt;p&gt;Use the desktop app to run parallel agents on different aspects of a feature:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agent 1: Implements the backend API endpoint&lt;/li&gt;
&lt;li&gt;Agent 2: Writes the frontend component&lt;/li&gt;
&lt;li&gt;Agent 3: Creates integration tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each agent runs in its own Git worktree, so their changes do not conflict. Review and merge the results when all agents complete.&lt;/p&gt;
&lt;h3&gt;The Exploration-First Pattern&lt;/h3&gt;
&lt;p&gt;Before giving Codex a complex task, use a &amp;quot;planning&amp;quot; prompt:&lt;/p&gt;
&lt;p&gt;&amp;quot;Analyze the authentication module in backend/auth/. Describe the current architecture, identify potential issues, and suggest improvements. Do not make any changes.&amp;quot;&lt;/p&gt;
&lt;p&gt;Review the agent&apos;s analysis, then use it as context for the actual implementation task. This prevents the agent from making changes based on incomplete understanding.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping AGENTS.md:&lt;/strong&gt; Without AGENTS.md, the agent has no guidance on project conventions and will produce code that technically works but does not match your style.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overly broad tasks:&lt;/strong&gt; &amp;quot;Improve the application&amp;quot; is too vague. &amp;quot;Add rate limiting to the /api/users endpoint using express-rate-limit with a 100-request-per-minute window&amp;quot; gives the agent clear parameters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring the review step:&lt;/strong&gt; Codex produces diffs and PRs for a reason. Always review the output, especially for tasks involving security, database changes, or public-facing features.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Skills for repeatable work:&lt;/strong&gt; If you find yourself writing the same type of task prompt repeatedly, extract it into a Skill.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using MCP when you do not need it:&lt;/strong&gt; Adding MCP servers increases complexity and potential failure points. Only connect external tools when the task genuinely requires external data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding tools, context engineering, and agentic development workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for ChatGPT: A Complete Guide to Getting Better Results</title><link>https://iceberglakehouse.com/posts/2026-03-context-chatgpt/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-chatgpt/</guid><description>
Getting consistently useful results from ChatGPT requires more than writing good prompts. The real differentiator is how you manage context: the back...</description><pubDate>Sat, 07 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Getting consistently useful results from ChatGPT requires more than writing good prompts. The real differentiator is how you manage context: the background information, instructions, documents, and accumulated knowledge that shapes every response ChatGPT generates. Without deliberate context management, you end up repeating yourself, getting generic answers, and wasting time course-correcting the AI.&lt;/p&gt;
&lt;p&gt;This guide covers every context management tool ChatGPT offers in 2026, from basic custom instructions to advanced Project workflows, and explains when to use each one.&lt;/p&gt;
&lt;h2&gt;What Is Context Management and Why Does It Matter?&lt;/h2&gt;
&lt;p&gt;Context management is the practice of controlling what information an AI model has access to when generating a response. Every time you interact with ChatGPT, the model processes a &amp;quot;context window,&amp;quot; basically the sum of all text it can see at once, including your conversation history, uploaded files, system instructions, and memory. The quality of the response depends directly on how well you curate that window.&lt;/p&gt;
&lt;p&gt;Poor context management looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Repeating your role, preferences, and constraints in every new conversation&lt;/li&gt;
&lt;li&gt;Uploading the same reference documents over and over&lt;/li&gt;
&lt;li&gt;Getting responses that ignore your project&apos;s specific terminology or conventions&lt;/li&gt;
&lt;li&gt;Spending more time correcting the AI than doing actual work&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good context management means ChatGPT already knows your background, has access to relevant documents, follows your preferred style, and builds on previous conversations without you manually re-establishing all of that every time.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;p&gt;Before configuring any tools, think about what level of context a given task actually needs. Not every conversation requires the same depth.&lt;/p&gt;
&lt;h3&gt;Minimal Context (Quick Questions)&lt;/h3&gt;
&lt;p&gt;For simple factual questions, brainstorming, or one-off tasks, you often need zero setup. Just ask the question. Adding unnecessary context actually dilutes the model&apos;s attention and can lead to worse responses. If you are asking &amp;quot;What is the difference between TCP and UDP?&amp;quot; you do not need to upload your network architecture docs.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Focused Work)&lt;/h3&gt;
&lt;p&gt;For tasks like drafting emails, reviewing code snippets, or writing sections of a document, provide the immediately relevant information in the conversation. Paste the specific text you are working with, reference the specific style or tone you want, and state any constraints. This keeps the model focused without overwhelming it.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Extended Projects)&lt;/h3&gt;
&lt;p&gt;For ongoing projects, research, or multi-session work, use ChatGPT&apos;s structured context tools (Projects, CustomGPTs, Memory). This is where deliberate context management pays the biggest dividends. You define the context once and it persists across every conversation in that workspace.&lt;/p&gt;
&lt;h3&gt;How to Decide&lt;/h3&gt;
&lt;p&gt;Ask yourself: &amp;quot;If I handed this task to a knowledgeable colleague, what would I need to tell them before they could start?&amp;quot; If the answer is &amp;quot;nothing, just the question,&amp;quot; use minimal context. If you would need to hand them a style guide, a codebase overview, and three reference documents, set up a Project.&lt;/p&gt;
&lt;h2&gt;Custom Instructions: Your Global Defaults&lt;/h2&gt;
&lt;p&gt;Custom Instructions are the most basic and most overlooked context management tool in ChatGPT. They apply to every conversation you have (unless you use a Project or CustomGPT with its own instructions).&lt;/p&gt;
&lt;h3&gt;How They Work&lt;/h3&gt;
&lt;p&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Personalization &amp;gt; Custom Instructions&lt;/strong&gt;. You get two fields:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;About you:&lt;/strong&gt; Tell ChatGPT who you are, what you do, and what background knowledge to assume. For example: &amp;quot;I am a senior data engineer working with Apache Iceberg, Spark, and Python. I build data lakehouse architectures for financial services companies.&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to respond:&lt;/strong&gt; Define your preferred output format, tone, and constraints. For example: &amp;quot;Be concise. Use code examples in Python unless I specify otherwise. Skip basic explanations of concepts I already know. Never use em dashes.&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Keep instructions specific and actionable. &amp;quot;Be helpful&amp;quot; is useless. &amp;quot;When I ask about SQL, always format queries with uppercase keywords and include comments explaining each join&amp;quot; is useful.&lt;/li&gt;
&lt;li&gt;Update them as your needs change. If you switch projects or roles, update your instructions.&lt;/li&gt;
&lt;li&gt;Use negative constraints. Telling ChatGPT what NOT to do is often more effective than listing everything it should do.&lt;/li&gt;
&lt;li&gt;Do not overload them. Custom Instructions have a character limit. Use them for universal preferences, not project-specific details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Limitations&lt;/h3&gt;
&lt;p&gt;Custom Instructions are global. They apply everywhere unless overridden by a Project or CustomGPT. If you work across multiple domains (coding, writing, research), your instructions need to be general enough to help everywhere without being so vague they help nowhere. For domain-specific work, use Projects instead.&lt;/p&gt;
&lt;h2&gt;Memory: Persistent Knowledge Across Conversations&lt;/h2&gt;
&lt;p&gt;ChatGPT&apos;s Memory feature allows the model to remember facts, preferences, and context across conversations without you re-stating them.&lt;/p&gt;
&lt;h3&gt;How Memory Works&lt;/h3&gt;
&lt;p&gt;When enabled (Settings &amp;gt; Personalization &amp;gt; Memory), ChatGPT can save information you share during conversations. It stores these as discrete facts: &amp;quot;User prefers Python over JavaScript,&amp;quot; &amp;quot;User&apos;s company uses PostgreSQL 15,&amp;quot; &amp;quot;User is writing a book about data engineering.&amp;quot;&lt;/p&gt;
&lt;p&gt;You can explicitly tell ChatGPT to remember things: &amp;quot;Remember that my team uses the Google style guide for Python.&amp;quot; You can also ask it what it remembers (&amp;quot;What do you know about me?&amp;quot;) and delete specific memories or clear them all.&lt;/p&gt;
&lt;h3&gt;When to Use Memory&lt;/h3&gt;
&lt;p&gt;Memory is best for facts that apply broadly across conversations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your technical stack and preferences&lt;/li&gt;
&lt;li&gt;Your role and expertise level&lt;/li&gt;
&lt;li&gt;Recurring project names or team members&lt;/li&gt;
&lt;li&gt;Style preferences that should persist everywhere&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When NOT to Use Memory&lt;/h3&gt;
&lt;p&gt;Memory is not a substitute for Projects or file uploads. It stores brief facts, not documents or complex context. Do not try to make ChatGPT &amp;quot;memorize&amp;quot; an entire API specification through Memory. Use file uploads for that.&lt;/p&gt;
&lt;h3&gt;Temporary Chats&lt;/h3&gt;
&lt;p&gt;If you want a conversation without Memory recall (for example, helping someone else with their problem or exploring a sensitive topic), use &lt;strong&gt;Temporary Chat&lt;/strong&gt;. This creates a blank-slate conversation that does not read from or write to Memory.&lt;/p&gt;
&lt;h2&gt;Projects: Dedicated Workspaces for Focused Work&lt;/h2&gt;
&lt;p&gt;Projects are ChatGPT&apos;s most powerful context management feature for sustained work. A Project is a dedicated workspace that groups related conversations, uploaded files, and custom instructions.&lt;/p&gt;
&lt;h3&gt;Setting Up a Project&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Projects&lt;/strong&gt; in the sidebar&lt;/li&gt;
&lt;li&gt;Create a new Project with a descriptive name&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;Project Instructions&lt;/strong&gt;: These override or supplement your global Custom Instructions for every conversation within this Project&lt;/li&gt;
&lt;li&gt;Upload &lt;strong&gt;files&lt;/strong&gt;: Up to 20 files per Project (PDFs, CSVs, images, text files). ChatGPT can reference these across all conversations in the Project.&lt;/li&gt;
&lt;li&gt;Start conversations within the Project&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Project Instructions vs. Custom Instructions&lt;/h3&gt;
&lt;p&gt;Project Instructions are scoped to the Project. They are the right place for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project-specific terminology and conventions&lt;/li&gt;
&lt;li&gt;The structure or outline of what you are building&lt;/li&gt;
&lt;li&gt;Style guides or formatting requirements specific to this work&lt;/li&gt;
&lt;li&gt;Background context about the domain&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Think of Custom Instructions as your personal defaults and Project Instructions as the briefing document for a specific engagement.&lt;/p&gt;
&lt;h3&gt;File Management in Projects&lt;/h3&gt;
&lt;p&gt;You can upload various file types to a Project&apos;s knowledge base:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reference documentation, research papers, specifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV/Excel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data samples, structured reference data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text/Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Style guides, code snippets, outlines, notes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Images&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Diagrams, mockups, screenshots for visual context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Use PDFs vs. Markdown&lt;/h3&gt;
&lt;p&gt;This is a practical question that matters more than most people realize.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use PDFs when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The document is a published specification, whitepaper, or research paper&lt;/li&gt;
&lt;li&gt;Layout and formatting matter (tables, figures, page references)&lt;/li&gt;
&lt;li&gt;You have the document in PDF form and do not want to convert it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Markdown when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are creating context documents specifically for ChatGPT&lt;/li&gt;
&lt;li&gt;You want the AI to parse the content with maximum accuracy&lt;/li&gt;
&lt;li&gt;The content is structured text (code standards, API docs, outlines)&lt;/li&gt;
&lt;li&gt;You plan to update the document frequently&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Markdown is generally a better format for AI consumption. The structure is unambiguous, there are no encoding issues from PDF extraction, and the content is more reliably parsed. If you are creating a reference document from scratch to guide ChatGPT, write it in Markdown.&lt;/p&gt;
&lt;h3&gt;Project Sharing&lt;/h3&gt;
&lt;p&gt;Projects can be shared with other ChatGPT users. When you share a Project, collaborators get access to the uploaded files, Project Instructions, and conversation history. This makes Projects useful for team workflows where multiple people need the AI to have the same context.&lt;/p&gt;
&lt;h2&gt;CustomGPTs: Specialized Assistants for Repeatable Tasks&lt;/h2&gt;
&lt;p&gt;CustomGPTs let you create purpose-built AI assistants with specific instructions, knowledge bases, and capabilities. They are the right tool when you have a repeatable workflow that requires specialized context.&lt;/p&gt;
&lt;h3&gt;When to Use a CustomGPT vs. a Project&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;CustomGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extended work on a specific project&lt;/td&gt;
&lt;td&gt;Repeatable tasks across different projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One body of work&lt;/td&gt;
&lt;td&gt;One type of task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shareable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (collaborators)&lt;/td&gt;
&lt;td&gt;Yes (public or private)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom actions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (API integrations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;quot;Q3 Marketing Campaign&amp;quot;&lt;/td&gt;
&lt;td&gt;&amp;quot;Technical Blog Editor&amp;quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A CustomGPT is like hiring a specialist. A Project is like setting up a war room for a specific mission.&lt;/p&gt;
&lt;h3&gt;Building an Effective CustomGPT&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Instructions:&lt;/strong&gt; Write detailed behavioral instructions. Include the role, tone, output format, and constraints. Be as specific as your best Custom Instructions, but scoped to this GPT&apos;s purpose.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Knowledge files:&lt;/strong&gt; Upload reference documents that the GPT should always have access to. These function like Project files but are permanently attached to the GPT.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actions:&lt;/strong&gt; Connect external APIs so the GPT can fetch real-time data, submit forms, or interact with your tools.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Knowledge Base Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Name your files descriptively. &amp;quot;company-style-guide-2026.md&amp;quot; is better than &amp;quot;doc1.pdf.&amp;quot;&lt;/li&gt;
&lt;li&gt;Include a table of contents or summary at the top of large documents. This helps ChatGPT navigate the content.&lt;/li&gt;
&lt;li&gt;Keep individual files focused. Ten small, focused files work better than one 200-page PDF.&lt;/li&gt;
&lt;li&gt;Test your GPT after uploading. Ask questions that require it to reference specific sections of your documents to verify it is parsing them correctly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;ChatGPT added support for the Model Context Protocol (MCP) in September 2025 through a &amp;quot;Developer Mode&amp;quot; feature. This is available to paying users on Plus, Pro, Team, Enterprise, and Education plans.&lt;/p&gt;
&lt;h3&gt;How MCP Works in ChatGPT&lt;/h3&gt;
&lt;p&gt;With Developer Mode enabled, ChatGPT can connect to MCP servers that expose external tools and data sources. This means ChatGPT can interact with services like Jira, Google Calendar, databases, and custom APIs directly from the chat interface. MCP connections are configured through the ChatGPT settings under Developer Mode, where you specify the MCP server endpoints.&lt;/p&gt;
&lt;h3&gt;What MCP Enables&lt;/h3&gt;
&lt;p&gt;MCP in ChatGPT goes beyond read-only data access. It supports both read and write operations, meaning ChatGPT can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fetch data from external systems (database queries, API lookups)&lt;/li&gt;
&lt;li&gt;Update external systems (create tickets, send messages, update records)&lt;/li&gt;
&lt;li&gt;Interact with local files and applications when using the desktop app&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP vs. CustomGPT Actions&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;MCP Servers&lt;/th&gt;
&lt;th&gt;CustomGPT Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Protocol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standardized (MCP)&lt;/td&gt;
&lt;td&gt;Custom API definitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configure via Developer Mode&lt;/td&gt;
&lt;td&gt;Build into a CustomGPT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Portability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works across MCP-compatible tools&lt;/td&gt;
&lt;td&gt;ChatGPT only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and write&lt;/td&gt;
&lt;td&gt;Read and write&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;MCP servers offer the advantage of portability: the same MCP server you use with ChatGPT works with Claude Desktop, Cursor, and other MCP-compatible tools. CustomGPT Actions are ChatGPT-specific but offer tighter integration within the CustomGPT workflow.&lt;/p&gt;
&lt;h3&gt;Security Considerations&lt;/h3&gt;
&lt;p&gt;OpenAI has cautioned that using Developer Mode with write operations is powerful but carries risk. Always test MCP server connections carefully, especially for servers that can modify external systems. Be aware of potential prompt injection risks when connecting to untrusted data sources.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Maximum Effectiveness&lt;/h2&gt;
&lt;p&gt;Beyond the tools themselves, how you structure the information you give ChatGPT matters significantly.&lt;/p&gt;
&lt;h3&gt;The Inverted Pyramid&lt;/h3&gt;
&lt;p&gt;Put the most important context first. ChatGPT pays more attention to the beginning and end of its context window. Structure your information like a news article:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Lead:&lt;/strong&gt; The task, constraint, and desired output format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Body:&lt;/strong&gt; Supporting details, reference material, examples&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Background:&lt;/strong&gt; Nice-to-have context that might help but is not critical&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Be Explicit About What You Want&lt;/h3&gt;
&lt;p&gt;Vague requests get vague results. Compare:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vague:&lt;/strong&gt; &amp;quot;Help me with my database.&amp;quot;
&lt;strong&gt;Specific:&lt;/strong&gt; &amp;quot;Review this PostgreSQL query for performance issues. The table has 50 million rows, is partitioned by date, and has indexes on customer_id and order_date. Suggest index changes or query rewrites that would reduce execution time.&amp;quot;&lt;/p&gt;
&lt;p&gt;The specific version gives ChatGPT enough context to provide actionable advice. The vague version will produce a generic tutorial.&lt;/p&gt;
&lt;h3&gt;Use Reference Examples&lt;/h3&gt;
&lt;p&gt;When you want a specific output format or style, give ChatGPT an example. &amp;quot;Write a commit message in this style: [example]&amp;quot; is far more effective than describing the style in abstract terms. Examples are compressed context. One good example communicates more than a paragraph of description.&lt;/p&gt;
&lt;h3&gt;Manage Conversation Length&lt;/h3&gt;
&lt;p&gt;Long conversations degrade response quality. As the conversation history grows, ChatGPT has less room in its context window for your actual question and the reasoning needed to answer it. For extended work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start new conversations for new topics, even within the same Project&lt;/li&gt;
&lt;li&gt;Summarize progress before starting a new conversation (&amp;quot;Here is where we left off: [summary]&amp;quot;)&lt;/li&gt;
&lt;li&gt;Use Projects so you do not lose the files and instructions when you start fresh&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Briefing Document Pattern&lt;/h3&gt;
&lt;p&gt;Create a Markdown file that serves as a comprehensive briefing for ChatGPT. Include:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: [Name]

## Overview

[2-3 sentence summary of what this project is]

## Goals

- [Specific goal 1]
- [Specific goal 2]

## Constraints

- [Technical constraints]
- [Style/format constraints]

## Key Terminology

- **Term 1:** Definition specific to this project
- **Term 2:** Definition specific to this project

## Current Status

[Where the project stands right now]

## What I Need Help With

[Specific areas where ChatGPT should focus]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Upload this as a Project file or paste it at the start of key conversations. It gives ChatGPT a structured, scannable overview that dramatically improves response relevance.&lt;/p&gt;
&lt;h3&gt;The Iterative Refinement Loop&lt;/h3&gt;
&lt;p&gt;For complex outputs (long documents, code architectures, research reports):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with a high-level outline and get ChatGPT&apos;s feedback&lt;/li&gt;
&lt;li&gt;Refine the outline based on the feedback&lt;/li&gt;
&lt;li&gt;Generate content section by section, reviewing each before moving on&lt;/li&gt;
&lt;li&gt;Use follow-up prompts to refine specific sections&lt;/li&gt;
&lt;li&gt;Do a final consistency pass&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach keeps the context focused at each step rather than asking ChatGPT to hold the entire deliverable in mind at once.&lt;/p&gt;
&lt;h3&gt;Multi-GPT Workflows&lt;/h3&gt;
&lt;p&gt;For complex projects, use different CustomGPTs for different aspects of the work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &amp;quot;Research GPT&amp;quot; with academic papers and data sources&lt;/li&gt;
&lt;li&gt;A &amp;quot;Writing GPT&amp;quot; with your style guide and brand voice instructions&lt;/li&gt;
&lt;li&gt;A &amp;quot;Code Review GPT&amp;quot; with your codebase standards and architecture docs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Feed the output of one into the next. This keeps each GPT focused on what it does best instead of trying to make one GPT do everything.&lt;/p&gt;
&lt;h2&gt;Common Mistakes to Avoid&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overloading context:&lt;/strong&gt; More is not always better. If you upload 20 files but your question only relates to one, the AI may pull irrelevant information from the other 19. Be selective.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Custom Instructions:&lt;/strong&gt; Many users never set them up, then wonder why ChatGPT gives generic responses. Spending 10 minutes on Custom Instructions saves hours of correction.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Projects for project work:&lt;/strong&gt; Having 50 disconnected conversations about the same project means ChatGPT has no persistent context. Use Projects.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Treating Memory as a database:&lt;/strong&gt; Memory stores brief facts, not documents. If you need ChatGPT to reference a 30-page specification, upload it as a file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Never clearing context:&lt;/strong&gt; Sometimes the best thing to do is start a fresh conversation. If ChatGPT seems confused or is repeating mistakes, the conversation history may be working against you.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Recommended Workflow for New Users&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;First 10 minutes:&lt;/strong&gt; Set up Custom Instructions with your role, expertise level, and response preferences&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First project:&lt;/strong&gt; Create a Project, upload 2-3 key reference documents, and write Project Instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First week:&lt;/strong&gt; Enable Memory and let it accumulate useful facts. Review and edit memories periodically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First month:&lt;/strong&gt; If you find yourself doing the same type of task repeatedly, build a CustomGPT for it.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI tools, including detailed context management strategies for coding, research, and professional workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for T3 Chat: A Complete Guide to the Unified Multi-Model AI Interface</title><link>https://iceberglakehouse.com/posts/2026-03-context-t3-chat/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-t3-chat/</guid><description>
T3 Chat is a modern web-based AI chat interface that gives you access to multiple AI models through a single unified platform. Its primary value prop...</description><pubDate>Sat, 07 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;T3 Chat is a modern web-based AI chat interface that gives you access to multiple AI models through a single unified platform. Its primary value proposition is model flexibility: instead of being locked into one provider, you can switch between Claude, GPT, Gemini, Llama, and other models within the same interface. This makes T3 Chat unique from a context management perspective because the same context strategies must work across fundamentally different model families with different capabilities, context window sizes, and strengths.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in T3 Chat to get the most from its multi-model architecture, from conversation organization to system prompts and file handling.&lt;/p&gt;
&lt;h2&gt;How T3 Chat Manages Context&lt;/h2&gt;
&lt;p&gt;T3 Chat builds its context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;System prompts&lt;/strong&gt; - persistent instructions that shape every response&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model selection&lt;/strong&gt; - the underlying model determines context window and capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - the message thread within the current chat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File attachments&lt;/strong&gt; - documents and images uploaded to conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Personas&lt;/strong&gt; - saved configurations combining system prompts with preferred models&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Folders and organization&lt;/strong&gt; - conversation grouping for project-based workflows&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The context management challenge unique to T3 Chat is that different models interpret your context differently. A system prompt that works well with Claude may need adjustment for GPT or Gemini. Understanding these differences helps you write model-portable context.&lt;/p&gt;
&lt;h2&gt;System Prompts: The Foundation&lt;/h2&gt;
&lt;p&gt;T3 Chat supports custom system prompts that you set per-conversation or through Personas.&lt;/p&gt;
&lt;h3&gt;Writing Effective System Prompts&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;You are a senior software architect with expertise in distributed systems.

## Response Style

- Be technical and precise
- Include code examples when relevant
- Use bullet points for lists of recommendations
- Explain tradeoffs, do not just give the &amp;quot;right&amp;quot; answer

## Constraints

- Assume the reader has 5+ years of programming experience
- Do not explain basic concepts unless asked
- When discussing frameworks, focus on architectural implications, not syntax tutorials

## Output Format

- Use headers to organize long responses
- Include a &amp;quot;Key Takeaway&amp;quot; section at the end of detailed analyses
- Format code blocks with language annotations
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Model-Portable System Prompts&lt;/h3&gt;
&lt;p&gt;Because T3 Chat supports multiple models, write system prompts that work across model families:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Be explicit&lt;/strong&gt; about format expectations. Different models interpret vague formatting instructions differently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid model-specific references.&lt;/strong&gt; Do not write &amp;quot;As Claude, you should...&amp;quot; or &amp;quot;Using your GPT capabilities...&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Focus on behavior and output.&lt;/strong&gt; Describe what you want the model to do, not how you think it should reason internally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test across models.&lt;/strong&gt; Send the same prompt to Claude, GPT, and Gemini within T3 Chat to verify consistent behavior.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Personas: Reusable Context Configurations&lt;/h2&gt;
&lt;p&gt;Personas combine a system prompt with a preferred model selection into a reusable configuration. Think of them as &amp;quot;modes&amp;quot; you can switch between.&lt;/p&gt;
&lt;h3&gt;Creating Effective Personas&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Persona&lt;/th&gt;
&lt;th&gt;System Prompt Focus&lt;/th&gt;
&lt;th&gt;Model Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Reviewer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security, performance, style guide checks&lt;/td&gt;
&lt;td&gt;Claude Sonnet (strong at code analysis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Technical Writer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Documentation standards, audience awareness&lt;/td&gt;
&lt;td&gt;GPT-4o (strong at prose)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Research Analyst&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Citation requirements, source evaluation&lt;/td&gt;
&lt;td&gt;Gemini Pro (strong at retrieval and synthesis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Creative Brainstormer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Divergent thinking, idea generation&lt;/td&gt;
&lt;td&gt;Claude Opus or GPT-4o (creative capabilities)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Create Personas&lt;/h3&gt;
&lt;p&gt;Create a Persona when you find yourself:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Repeating the same system prompt across conversations&lt;/li&gt;
&lt;li&gt;Switching to the same model for a specific type of task&lt;/li&gt;
&lt;li&gt;Wanting to standardize how the AI handles a particular workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Personas save time and ensure consistency. Instead of re-configuring the system prompt and model for each new conversation, select the appropriate Persona and start working.&lt;/p&gt;
&lt;h2&gt;Model Selection as Context Management&lt;/h2&gt;
&lt;p&gt;Choosing the right model in T3 Chat is itself a context management decision because different models have different context window sizes and capabilities.&lt;/p&gt;
&lt;h3&gt;Context Window Comparison&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Approximate Context Window&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Long context, code analysis, nuanced reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Opus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Complex analysis, creative writing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Broad capabilities, strong at prose and instruction following&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-o3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Deep reasoning, complex problem solving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1M+ tokens&lt;/td&gt;
&lt;td&gt;Massive context, document analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 (70B)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Open source, privacy-friendly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Model Selection Strategy&lt;/h3&gt;
&lt;p&gt;For T3 Chat users, the model selection strategy directly affects context management:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Long documents or many files:&lt;/strong&gt; Choose Gemini Pro (massive context window) or Claude (200K)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quick questions:&lt;/strong&gt; Choose a fast model (GPT-4o-mini, Claude Haiku) for responsiveness&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy-sensitive content:&lt;/strong&gt; Choose Llama through a local endpoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Complex analysis:&lt;/strong&gt; Choose Claude Opus or GPT-o3 for deep reasoning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Being deliberate about model selection means your context is used more effectively by a model suited to the task.&lt;/p&gt;
&lt;h2&gt;Conversation Organization&lt;/h2&gt;
&lt;p&gt;T3 Chat provides tools for organizing your conversations into a structured workspace.&lt;/p&gt;
&lt;h3&gt;Folders&lt;/h3&gt;
&lt;p&gt;Group conversations by project, topic, or workflow. This is not just for tidiness; organized conversations make it easier to find and resume context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/projects/web-app/&lt;/code&gt; might contain conversations about frontend, backend, and deployment&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/research/market-analysis/&lt;/code&gt; might contain conversations about different market segments&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/writing/blog-series/&lt;/code&gt; might contain conversations for each blog post&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Pinned Conversations&lt;/h3&gt;
&lt;p&gt;Pin important conversations for quick access. Pin your most frequently referenced threads so you can revisit them without searching.&lt;/p&gt;
&lt;h3&gt;Naming Conventions&lt;/h3&gt;
&lt;p&gt;Name conversations descriptively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Auth module refactoring plan&amp;quot; is searchable and findable&lt;/li&gt;
&lt;li&gt;&amp;quot;New chat 47&amp;quot; is neither&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good naming is a form of context management because it makes your accumulated knowledge retrievable.&lt;/p&gt;
&lt;h2&gt;File Attachments&lt;/h2&gt;
&lt;p&gt;T3 Chat supports file uploads for providing document-level context within conversations.&lt;/p&gt;
&lt;h3&gt;Supported File Types&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Documents:&lt;/strong&gt; PDFs, Markdown, plain text&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Images:&lt;/strong&gt; Screenshots, diagrams, mockups&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spreadsheets:&lt;/strong&gt; CSV, Excel files for data analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code files:&lt;/strong&gt; Source code in any language&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices for File Attachments&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Upload only the files relevant to the current question. Uploading your entire project creates noise.&lt;/li&gt;
&lt;li&gt;For long documents, tell the model which sections to focus on: &amp;quot;This is our API specification. Focus on the authentication endpoints in Section 3.&amp;quot;&lt;/li&gt;
&lt;li&gt;For images, provide a text description of what the model should look for: &amp;quot;This is a screenshot of our dashboard. The chart in the upper-right shows incorrect data.&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;T3 Chat can process PDFs uploaded as attachments. PDFs work well for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Formal documents (research papers, specifications, contracts)&lt;/li&gt;
&lt;li&gt;Published content with fixed formatting&lt;/li&gt;
&lt;li&gt;Multi-page documents with embedded images and tables&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown&lt;/h3&gt;
&lt;p&gt;For context you author specifically for the AI (system prompts, reference documents, instructions), Markdown is cleaner:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Models parse Markdown more reliably than extracted PDF text&lt;/li&gt;
&lt;li&gt;Markdown is easier to version and update&lt;/li&gt;
&lt;li&gt;The structure (headings, lists, code blocks) is explicit, not inferred&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Practical Rule&lt;/h3&gt;
&lt;p&gt;If the document exists as a PDF and you cannot easily convert it, upload the PDF. If you are writing the document for the purpose of giving it to the AI, write it in Markdown.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;T3 Chat supports MCP (Model Context Protocol) server connections, allowing the platform to integrate with external data sources and tools. This extends T3 Chat&apos;s capabilities beyond conversation and file uploads by enabling connections to services like Google Drive, Slack, GitHub, databases, and custom APIs.&lt;/p&gt;
&lt;h3&gt;How MCP Works in T3 Chat&lt;/h3&gt;
&lt;p&gt;MCP servers provide T3 Chat with access to external resources and tools. When configured, the AI can query external data sources, retrieve real-time information, and perform actions through connected services. This makes T3 Chat more than just a chatbot: it becomes an interface for interacting with your broader tool ecosystem.&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value&lt;/h3&gt;
&lt;p&gt;MCP is most useful in T3 Chat when your conversations need live data access:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Querying a database while discussing architecture decisions&lt;/li&gt;
&lt;li&gt;Accessing project management data during planning conversations&lt;/li&gt;
&lt;li&gt;Retrieving documentation from connected services&lt;/li&gt;
&lt;li&gt;Interacting with APIs through a conversational interface&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For conversations that rely purely on the AI&apos;s training data or uploaded files, MCP is unnecessary. It adds the most value when you need real-time connections to external systems during your conversations.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in T3 Chat&lt;/h2&gt;
&lt;h3&gt;Quick Questions (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For factual or conceptual questions, just ask. No special setup needed:&lt;/p&gt;
&lt;p&gt;&amp;quot;What is the difference between horizontal and vertical scaling in database architecture?&amp;quot;&lt;/p&gt;
&lt;p&gt;The model&apos;s training data is sufficient context, and no files or custom prompts are required.&lt;/p&gt;
&lt;h3&gt;Working Sessions (Moderate Context)&lt;/h3&gt;
&lt;p&gt;For sustained work on a topic, create a conversation with an appropriate Persona and provide reference files:&lt;/p&gt;
&lt;p&gt;&amp;quot;I am building a REST API for a healthcare application. Here is the data model [attach file]. Help me design the endpoints following HIPAA compliance patterns.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Complex Projects (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;For multi-day projects, create a folder of organized conversations, use Personas for different phases of work, and bridge context between conversations using explicit summaries.&lt;/p&gt;
&lt;h2&gt;Model-Specific Context Tuning&lt;/h2&gt;
&lt;p&gt;Each model family responds slightly differently to the same context. Here are practical tips for tuning:&lt;/p&gt;
&lt;h3&gt;Claude in T3 Chat&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Responds well to role-based system prompts (&amp;quot;You are a...&amp;quot;)&lt;/li&gt;
&lt;li&gt;Handles very long contexts gracefully&lt;/li&gt;
&lt;li&gt;Benefits from explicit format instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;GPT Models in T3 Chat&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Follows formatting instructions precisely&lt;/li&gt;
&lt;li&gt;Works well with example-based prompts (&amp;quot;Here is an example of what I want: ...&amp;quot;)&lt;/li&gt;
&lt;li&gt;Benefits from numbered constraints&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Gemini in T3 Chat&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Excels with document analysis tasks&lt;/li&gt;
&lt;li&gt;Handles massive context windows (1M+ tokens)&lt;/li&gt;
&lt;li&gt;Benefits from clear section headers in system prompts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use T3 Chat vs. Other Tools&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use T3 Chat when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want to compare responses across different models&lt;/li&gt;
&lt;li&gt;You need flexible model selection without multiple subscriptions&lt;/li&gt;
&lt;li&gt;Your task is conversational (research, analysis, writing, brainstorming)&lt;/li&gt;
&lt;li&gt;You want Personas for reusable configurations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a coding IDE (Cursor, Windsurf, Zed) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your task involves editing code files directly&lt;/li&gt;
&lt;li&gt;You need workspace indexing and @codebase search&lt;/li&gt;
&lt;li&gt;You want agent mode to make cross-file changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a terminal agent (Claude Code, Gemini CLI) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need direct terminal access and command execution&lt;/li&gt;
&lt;li&gt;Your task involves running tests, builds, or deployments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Model Comparison Pattern&lt;/h3&gt;
&lt;p&gt;Use T3 Chat&apos;s multi-model support to compare responses:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Ask the same question to Claude, GPT, and Gemini&lt;/li&gt;
&lt;li&gt;Compare the responses for accuracy, depth, and style&lt;/li&gt;
&lt;li&gt;Use the best response as a starting point and refine it&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is especially useful for high-stakes content where you want multiple perspectives before finalizing.&lt;/p&gt;
&lt;h3&gt;The Persona Pipeline Pattern&lt;/h3&gt;
&lt;p&gt;Chain Personas for multi-step work:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Research Persona&lt;/strong&gt; (Gemini): Gather information and sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analysis Persona&lt;/strong&gt; (Claude): Analyze the research and identify key themes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing Persona&lt;/strong&gt; (GPT): Draft the final output based on the analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step uses a model optimized for that type of work, with context transferred manually between conversations.&lt;/p&gt;
&lt;h3&gt;The Context Bridging Pattern&lt;/h3&gt;
&lt;p&gt;When switching between models in the same conversation, bridge the context explicitly:&lt;/p&gt;
&lt;p&gt;&amp;quot;Here is a summary of what we discussed so far: [summary]. I am switching to a different model. Please continue from this point.&amp;quot;&lt;/p&gt;
&lt;p&gt;This helps the new model pick up the thread without losing continuity.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Personas for repeatable work.&lt;/strong&gt; If you are configuring the same system prompt and model combination repeatedly, create a Persona.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring model differences.&lt;/strong&gt; Claude, GPT, and Gemini respond differently to the same prompt. If results are not meeting expectations, try a different model before rewriting the prompt.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uploading too many files.&lt;/strong&gt; Each file consumes context window space. Be selective and upload only what is relevant to the current question.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not organizing conversations.&lt;/strong&gt; Without folders and descriptive names, your accumulated research and context becomes unfindable as conversations accumulate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using the same model for everything.&lt;/strong&gt; T3 Chat&apos;s strength is model flexibility. Use Gemini for massive documents, Claude for code analysis, and GPT for prose generation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writing model-specific system prompts.&lt;/strong&gt; If your system prompt only works with one model, it is too model-specific. Write instructions that describe behavior and output, not internal reasoning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI interfaces and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for VS Code with LLM Plugins: A Complete Guide to Building Your Own AI-Powered IDE</title><link>https://iceberglakehouse.com/posts/2026-03-context-vscode-llm-plugins/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-vscode-llm-plugins/</guid><description>
Visual Studio Code is the most widely used code editor in the world, and its extensibility means you can integrate AI capabilities through a growing ...</description><pubDate>Sat, 07 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Visual Studio Code is the most widely used code editor in the world, and its extensibility means you can integrate AI capabilities through a growing ecosystem of LLM plugins. Unlike purpose-built AI editors (Cursor, Windsurf, Zed), VS Code gives you the freedom to choose and combine AI extensions, configure them to your preferences, and even switch between providers without changing editors. The tradeoff is that context management is not as seamlessly integrated as in dedicated AI editors. It requires more deliberate configuration.&lt;/p&gt;
&lt;p&gt;This guide covers context management strategies for the most popular VS Code AI extensions: GitHub Copilot, Continue, Cline (formerly Claude Dev), Aider, and others. It explains what context management capabilities each offers and how to configure them for maximum effectiveness.&lt;/p&gt;
&lt;h2&gt;The VS Code AI Extension Landscape&lt;/h2&gt;
&lt;p&gt;VS Code&apos;s AI extension ecosystem falls into several categories:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Extensions&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inline completion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub Copilot, CodeiumChat, Supermaven&lt;/td&gt;
&lt;td&gt;Suggest code as you type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chat panel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Copilot Chat, Continue, Cody&lt;/td&gt;
&lt;td&gt;Conversational AI in a sidebar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cline, Aider, Roo Code&lt;/td&gt;
&lt;td&gt;Autonomous agents that read/write files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Specialized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mintlify, Tabnine&lt;/td&gt;
&lt;td&gt;Documentation, enterprise-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each category manages context differently. Inline completion plugins use the current file and nearby tabs. Chat panel plugins use conversation history and file references. Agentic plugins have the broadest context, reading the codebase, running commands, and making multi-file changes.&lt;/p&gt;
&lt;h2&gt;GitHub Copilot: Context Management&lt;/h2&gt;
&lt;p&gt;GitHub Copilot is the most widely used AI coding assistant. Its context management has evolved significantly with the introduction of Copilot Chat and Agent Mode.&lt;/p&gt;
&lt;h3&gt;Inline Completions&lt;/h3&gt;
&lt;p&gt;Copilot&apos;s inline suggestions use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The current file content (especially the lines around your cursor)&lt;/li&gt;
&lt;li&gt;Open tabs in the editor (nearby files provide additional context)&lt;/li&gt;
&lt;li&gt;File names and directory structure (for naming conventions)&lt;/li&gt;
&lt;li&gt;Comment and docstring context (comments above your cursor guide suggestions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Keep related files open in tabs. Copilot considers open files as context, so having related source files, type definitions, and tests open improves suggestion quality.&lt;/p&gt;
&lt;h3&gt;Copilot Chat&lt;/h3&gt;
&lt;p&gt;Copilot Chat operates in the sidebar with conversation-based interaction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;#file&lt;/code&gt; to reference specific files&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;#editor&lt;/code&gt; to reference the active editor content&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;#selection&lt;/code&gt; to reference selected code&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;#codebase&lt;/code&gt; to search the workspace&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;@workspace&lt;/code&gt; to ask questions about the entire project&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Copilot Agent Mode&lt;/h3&gt;
&lt;p&gt;Agent Mode (introduced in 2025) makes Copilot an autonomous agent that can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plan multi-step changes&lt;/li&gt;
&lt;li&gt;Read and write files across the project&lt;/li&gt;
&lt;li&gt;Run terminal commands&lt;/li&gt;
&lt;li&gt;Make and verify changes iteratively&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent Mode uses the broadest context of any Copilot feature: it can explore the codebase, read package.json, check test results, and understand project structure before making changes.&lt;/p&gt;
&lt;h3&gt;Custom Instructions for Copilot&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Copilot Instructions

## Code Style

- Use TypeScript strict mode
- Prefer functional components with hooks
- Use named exports, not default exports
- Follow the Airbnb ESLint configuration

## Testing

- Write tests using Vitest
- Use React Testing Library for component tests
- Mock API calls with MSW

## Architecture

- Components go in src/components/
- API clients go in src/api/
- Shared types go in src/types/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These instructions are loaded by Copilot for every interaction within the project, functioning like .cursor/rules/ in Cursor.&lt;/p&gt;
&lt;h2&gt;Continue: Open-Source AI Extension&lt;/h2&gt;
&lt;p&gt;Continue is an open-source VS Code extension that supports multiple LLM providers and offers extensive context management features.&lt;/p&gt;
&lt;h3&gt;Provider Configuration&lt;/h3&gt;
&lt;p&gt;Continue supports:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OpenAI, Anthropic, Google models via API keys&lt;/li&gt;
&lt;li&gt;Ollama for local models&lt;/li&gt;
&lt;li&gt;Any OpenAI-compatible endpoint&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context Providers&lt;/h3&gt;
&lt;p&gt;Continue&apos;s &amp;quot;@-mention&amp;quot; context system includes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context Provider&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include a specific file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include code blocks from the codebase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@docs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search indexed documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@codebase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Semantic search across the project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@terminal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include recent terminal output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@diff&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include current Git diff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@repo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include repository metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@folder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include folder structure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;.continuerc.json Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;models&amp;quot;: [
    {
      &amp;quot;title&amp;quot;: &amp;quot;Claude Sonnet&amp;quot;,
      &amp;quot;provider&amp;quot;: &amp;quot;anthropic&amp;quot;,
      &amp;quot;model&amp;quot;: &amp;quot;claude-sonnet-4-20250514&amp;quot;,
      &amp;quot;apiKey&amp;quot;: &amp;quot;your-key&amp;quot;
    },
    {
      &amp;quot;title&amp;quot;: &amp;quot;Local Llama&amp;quot;,
      &amp;quot;provider&amp;quot;: &amp;quot;ollama&amp;quot;,
      &amp;quot;model&amp;quot;: &amp;quot;llama3.1:70b&amp;quot;
    }
  ],
  &amp;quot;customCommands&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;review&amp;quot;,
      &amp;quot;prompt&amp;quot;: &amp;quot;Review this code for security issues, performance problems, and style violations.&amp;quot;
    }
  ],
  &amp;quot;docs&amp;quot;: [
    {
      &amp;quot;title&amp;quot;: &amp;quot;React Docs&amp;quot;,
      &amp;quot;startUrl&amp;quot;: &amp;quot;https://react.dev/reference&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Why Continue Stands Out for Context&lt;/h3&gt;
&lt;p&gt;Continue&apos;s open-source nature means you can inspect exactly how context is assembled. Its support for custom context providers extends beyond built-in options, allowing teams to create project-specific context sources.&lt;/p&gt;
&lt;h2&gt;Cline (formerly Claude Dev): Agentic Coding Agent&lt;/h2&gt;
&lt;p&gt;Cline is a VS Code extension that turns Claude into an autonomous coding agent within the editor.&lt;/p&gt;
&lt;h3&gt;Context Capabilities&lt;/h3&gt;
&lt;p&gt;Cline has one of the broadest context scopes among VS Code extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reads and writes files across the entire project&lt;/li&gt;
&lt;li&gt;Runs terminal commands&lt;/li&gt;
&lt;li&gt;Browses the web (for documentation lookup)&lt;/li&gt;
&lt;li&gt;Takes screenshots of running applications&lt;/li&gt;
&lt;li&gt;Manages its own task history&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Project Instructions&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.clinerules&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: SaaS Application

## Stack

- Python 3.12 with FastAPI
- PostgreSQL with SQLAlchemy
- Redis for caching
- React frontend with TypeScript

## Build Commands

- Backend: `uvicorn app.main:app --reload`
- Frontend: `npm run dev`
- Tests: `pytest -v`

## Conventions

- All API responses use the ResponseModel pattern
- Database sessions are managed by dependency injection
- Frontend state uses React Query for server state
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Custom MCP Servers&lt;/h3&gt;
&lt;p&gt;Cline supports MCP servers configured through its settings panel, enabling connections to databases, APIs, and other external tools directly within the VS Code environment.&lt;/p&gt;
&lt;h3&gt;Context Window Management&lt;/h3&gt;
&lt;p&gt;Cline tracks context window usage and can summarize previous conversation history when the window fills up. This automatic context management prevents the common problem of long sessions degrading quality.&lt;/p&gt;
&lt;h2&gt;Aider: Git-Aware AI Pair Programmer&lt;/h2&gt;
&lt;p&gt;Aider integrates with VS Code as a terminal-based tool that focuses on Git-aware code modifications.&lt;/p&gt;
&lt;h3&gt;Context Management in Aider&lt;/h3&gt;
&lt;p&gt;Aider uses a unique context model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chat files:&lt;/strong&gt; Files actively being discussed and modified&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Watch files:&lt;/strong&gt; Files included as read-only context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repository map:&lt;/strong&gt; An overview of the entire repository structure that fits in context&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Commands for Context Control&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;/add src/auth/middleware.ts    # Add to chat context (can be edited)
/read docs/architecture.md     # Add as read-only context
/drop src/auth/middleware.ts   # Remove from context
/map                           # Show the repository map
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Repository Map&lt;/h3&gt;
&lt;p&gt;Aider&apos;s repository map is a compressed representation of your entire codebase (file names, function signatures, class definitions) that fits within the context window. This gives the AI a bird&apos;s-eye view of the project without consuming the entire context budget.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels Across Extensions&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Completions)&lt;/h3&gt;
&lt;p&gt;For inline code completions, Copilot and Supermaven work well with minimal setup. Keep related files open in tabs and let the extension use the editor context.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Development)&lt;/h3&gt;
&lt;p&gt;Use a chat extension (Copilot Chat, Continue) with explicit file references. The @-mention system lets you include exactly the files relevant to the current task.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Major Refactoring)&lt;/h3&gt;
&lt;p&gt;Use an agentic extension (Cline, Copilot Agent Mode) that can explore the codebase, run tests, and make changes across multiple files. Configure project instructions (.clinerules, copilot-instructions.md, .continuerc.json) to ensure the agent follows your conventions.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is Universal&lt;/h3&gt;
&lt;p&gt;All VS Code AI extensions work natively with Markdown. Project instructions, coding standards, and architecture documents should be Markdown files in your repository.&lt;/p&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;Most VS Code extensions do not parse PDFs directly. If you have reference material in PDF form, extract relevant sections into Markdown files. Some extensions (like Cline with web browsing) can fetch online documentation, reducing the need for local PDF conversion.&lt;/p&gt;
&lt;h3&gt;Documentation Indexing&lt;/h3&gt;
&lt;p&gt;Continue and Copilot Chat support documentation indexing through @docs. Add your framework documentation URLs to the extension configuration so the AI can reference current documentation during conversations.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;MCP support varies by extension:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Extension&lt;/th&gt;
&lt;th&gt;MCP Support&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Settings panel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Continue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;config.json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Copilot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Through GitHub integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aider&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Direct terminal commands instead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For extensions that support MCP, the configuration follows the standard pattern: specify the server command, arguments, and environment variables. MCP tools become available within the extension&apos;s chat or agent interface.&lt;/p&gt;
&lt;h2&gt;settings.json: Centralizing AI Configuration&lt;/h2&gt;
&lt;p&gt;VS Code&apos;s &lt;code&gt;settings.json&lt;/code&gt; is where many AI extensions read their configuration. Here are common settings patterns:&lt;/p&gt;
&lt;h3&gt;Per-Workspace Settings&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.vscode/settings.json&lt;/code&gt; file in your project to configure AI extensions per-project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;github.copilot.enable&amp;quot;: {
    &amp;quot;markdown&amp;quot;: true,
    &amp;quot;plaintext&amp;quot;: false
  },
  &amp;quot;continue.enableTabAutocomplete&amp;quot;: false,
  &amp;quot;cline.customInstructions&amp;quot;: &amp;quot;Follow the conventions in INSTRUCTIONS.md&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Per-workspace settings override user-level settings, allowing you to tailor AI behavior to each project&apos;s needs.&lt;/p&gt;
&lt;h3&gt;Workspace Trust and Security&lt;/h3&gt;
&lt;p&gt;VS Code&apos;s Workspace Trust feature is important when using AI extensions. In untrusted workspaces, some extensions may limit their capabilities (for example, restricting file access or command execution). This is a security feature: it prevents untrusted code from being automatically processed by AI tools that have file system access.&lt;/p&gt;
&lt;p&gt;For your own projects, trust the workspace to enable full AI capabilities. For third-party codebases, consider the implications before trusting.&lt;/p&gt;
&lt;h2&gt;When to Use VS Code with Plugins vs. Dedicated AI Editors&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose VS Code with plugins when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You already use VS Code and want to add AI incrementally&lt;/li&gt;
&lt;li&gt;You want to mix and match extensions from different providers&lt;/li&gt;
&lt;li&gt;You have existing VS Code extensions and workflows you cannot replicate elsewhere&lt;/li&gt;
&lt;li&gt;You need the specific capabilities of an extension that only exists for VS Code (like Cline)&lt;/li&gt;
&lt;li&gt;Your team uses different AI providers and needs a common editor&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Choose Cursor or Windsurf when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want the most seamlessly integrated AI experience&lt;/li&gt;
&lt;li&gt;You prefer automatic codebase indexing over manual context management&lt;/li&gt;
&lt;li&gt;You are starting fresh and do not have an existing VS Code extension stack&lt;/li&gt;
&lt;li&gt;You want features like .cursor/rules/ or Cascade flows that are deeply integrated&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Choose a terminal agent (Claude Code, Gemini CLI) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your workflow is terminal-centric&lt;/li&gt;
&lt;li&gt;You need direct shell command execution as your primary interaction&lt;/li&gt;
&lt;li&gt;You prefer a focused, distraction-free coding experience&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Multi-Extension Stack&lt;/h3&gt;
&lt;p&gt;Use multiple extensions simultaneously for different purposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Copilot&lt;/strong&gt; for inline completions (fast, low-friction)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continue&lt;/strong&gt; for chat with @codebase search (exploratory questions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cline&lt;/strong&gt; for agentic tasks (multi-file changes, complex features)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each extension handles a different level of context and interaction.&lt;/p&gt;
&lt;h3&gt;The Consistent Instructions Pattern&lt;/h3&gt;
&lt;p&gt;Maintain a single &lt;code&gt;INSTRUCTIONS.md&lt;/code&gt; file in your project root and reference it from each extension&apos;s configuration:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; imports or mirrors INSTRUCTIONS.md&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.continuerc.json&lt;/code&gt; references INSTRUCTIONS.md&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.clinerules&lt;/code&gt; mirrors the same conventions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This ensures consistent behavior regardless of which extension handles the task.&lt;/p&gt;
&lt;h3&gt;The Provider Rotation Pattern&lt;/h3&gt;
&lt;p&gt;Use different providers for different extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Copilot: GitHub&apos;s infrastructure (fast, always available)&lt;/li&gt;
&lt;li&gt;Continue: Anthropic API (strong at code analysis)&lt;/li&gt;
&lt;li&gt;Cline: Local Ollama model (privacy for sensitive code)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This gives you the benefits of multiple providers within a single editor.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using too many AI extensions simultaneously.&lt;/strong&gt; Running five AI extensions creates conflicts, performance overhead, and conflicting suggestions. Pick a primary stack and disable the rest.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not configuring project instructions.&lt;/strong&gt; Every AI extension supports some form of project-level instructions. Without them, the AI relies on generic conventions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring @codebase search.&lt;/strong&gt; Both Copilot Chat and Continue offer codebase search. Using it produces more relevant responses than manually specifying files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not keeping related tabs open.&lt;/strong&gt; Inline completion quality improves when related files are open in the editor. Keep type definitions, tests, and related source files in your tab bar.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Choosing the wrong extension for the task.&lt;/strong&gt; Inline completions for quick code, chat for questions, agent mode for complex changes. Match the tool to the task.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping documentation indexing.&lt;/strong&gt; If you are working with a framework, index its documentation so the AI references current, accurate information rather than potentially outdated training data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Model Context Protocol (MCP) Explained: A Complete Guide to How Every Major AI Tool Connects to External Data</title><link>https://iceberglakehouse.com/posts/2026-03-context-mcp-deep-dive/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-mcp-deep-dive/</guid><description>
The Model Context Protocol (MCP) has become the universal standard for connecting AI models to external tools, data sources, and services. Originally...</description><pubDate>Sat, 07 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The Model Context Protocol (MCP) has become the universal standard for connecting AI models to external tools, data sources, and services. Originally open-sourced by Anthropic in November 2024 and now managed by the Linux Foundation, MCP solves one of the biggest frustrations in working with AI: getting models to interact with the systems where your actual work lives. Instead of copying and pasting data into chat windows or uploading files manually, MCP lets AI tools query databases, read documentation, interact with APIs, manage files, and perform actions across your entire tool ecosystem through a standardized protocol.&lt;/p&gt;
&lt;p&gt;This guide explains how MCP works at a technical level, what problems it solves, and exactly how each of the 17 major AI tools supports it (or does not). If you use any AI coding assistant, research tool, or chat interface, understanding MCP will help you build a more connected and productive AI workflow.&lt;/p&gt;
&lt;h2&gt;What Is MCP and Why Does It Matter?&lt;/h2&gt;
&lt;p&gt;MCP is a client-server protocol that standardizes how AI applications connect to external resources. Before MCP, every AI tool that wanted to connect to an external service needed a custom integration. If you wanted Claude to query your PostgreSQL database, you needed one integration. If you wanted ChatGPT to do the same thing, you needed a completely different integration. MCP eliminates this duplication by creating a single protocol that works across tools.&lt;/p&gt;
&lt;h3&gt;The Architecture&lt;/h3&gt;
&lt;p&gt;MCP uses a three-part architecture:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; The AI application you are using (ChatGPT, Claude Desktop, Cursor, etc.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Client:&lt;/strong&gt; The component within the host that manages MCP connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Server:&lt;/strong&gt; An external process that exposes tools, resources, and prompts to the AI&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The server is where the power lives. An MCP server can expose any capability: reading files, querying databases, calling APIs, running commands, searching the web, or interacting with services like GitHub, Jira, or Slack. The AI model discovers these capabilities through the protocol and can invoke them during conversations.&lt;/p&gt;
&lt;h3&gt;How Communication Works&lt;/h3&gt;
&lt;p&gt;MCP supports multiple transport mechanisms:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;STDIO (Standard Input/Output):&lt;/strong&gt; The host launches the MCP server as a local process and communicates through stdin/stdout. This is the most common approach for local servers. The AI tool starts the server process, sends requests through standard input, and reads responses from standard output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Streamable HTTP:&lt;/strong&gt; The server runs as a web service accessible via HTTP. This is used for remote servers and cloud-hosted services. It supports bearer token and OAuth authentication for secure connections.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Server-Sent Events (SSE):&lt;/strong&gt; An older HTTP-based transport that some tools still support. Being superseded by Streamable HTTP in newer implementations.&lt;/p&gt;
&lt;h3&gt;What MCP Servers Expose&lt;/h3&gt;
&lt;p&gt;An MCP server can expose three types of capabilities:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Tools:&lt;/strong&gt; Functions the AI can call (query a database, create a file, search an API)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resources:&lt;/strong&gt; Data the AI can read (file contents, database schemas, documentation)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompts:&lt;/strong&gt; Pre-built prompt templates the AI can use&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When an AI tool connects to an MCP server, it discovers the available tools and their parameter schemas. The AI model then decides when to invoke these tools based on the user&apos;s request. For example, if you ask &amp;quot;What tables are in my database?&amp;quot;, the AI sees that a database MCP server has a &amp;quot;list_tables&amp;quot; tool and calls it automatically.&lt;/p&gt;
&lt;h2&gt;The MCP Server Ecosystem&lt;/h2&gt;
&lt;p&gt;The MCP ecosystem has grown rapidly. Common server categories include:&lt;/p&gt;
&lt;h3&gt;Data and Database Servers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostgreSQL MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Query PostgreSQL databases, inspect schemas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MySQL MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Query MySQL databases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQLite MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and write SQLite databases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MongoDB MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Query MongoDB collections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snowflake MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Query Snowflake data warehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Development Tool Servers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manage repos, issues, PRs, code search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitLab MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitLab API integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jira MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create and manage tickets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sentry MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Error tracking and debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Playwright MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Browser automation and testing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Productivity and Communication Servers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Drive MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and manage Google Drive files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slack MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and send Slack messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gmail MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and send emails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Notion MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Query and update Notion pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Calendar MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manage calendar events&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Specialized Servers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Filesystem MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and write local files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Brave Search MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Web search capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fetch MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP requests to URLs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory/Knowledge MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Persistent knowledge storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docker MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Container management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key insight is that any MCP server you set up works across every MCP-compatible tool. A PostgreSQL MCP server configured once can be used by Claude Desktop, ChatGPT, Cursor, Gemini CLI, and any other tool that supports the protocol.&lt;/p&gt;
&lt;h2&gt;Writing Your Own MCP Server&lt;/h2&gt;
&lt;p&gt;MCP servers can be written in any language that supports stdin/stdout or HTTP. The most common implementations use TypeScript/JavaScript or Python.&lt;/p&gt;
&lt;h3&gt;Basic Server Structure (TypeScript)&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-typescript&quot;&gt;import { McpServer } from &amp;quot;@modelcontextprotocol/sdk/server/mcp.js&amp;quot;;
import { StdioServerTransport } from &amp;quot;@modelcontextprotocol/sdk/server/stdio.js&amp;quot;;

const server = new McpServer({
  name: &amp;quot;my-custom-server&amp;quot;,
  version: &amp;quot;1.0.0&amp;quot;,
});

// Define a tool
server.tool(
  &amp;quot;get_weather&amp;quot;,
  &amp;quot;Get the current weather for a location&amp;quot;,
  { location: { type: &amp;quot;string&amp;quot;, description: &amp;quot;City name&amp;quot; } },
  async ({ location }) =&amp;gt; {
    const weather = await fetchWeatherAPI(location);
    return {
      content: [{ type: &amp;quot;text&amp;quot;, text: JSON.stringify(weather) }],
    };
  }
);

// Start the server
const transport = new StdioServerTransport();
await server.connect(transport);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Basic Server Structure (Python)&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from mcp.server import Server
from mcp.types import Tool, TextContent
import mcp.server.stdio

server = Server(&amp;quot;my-custom-server&amp;quot;)

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name=&amp;quot;get_weather&amp;quot;,
            description=&amp;quot;Get current weather for a location&amp;quot;,
            inputSchema={
                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
                &amp;quot;properties&amp;quot;: {
                    &amp;quot;location&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;description&amp;quot;: &amp;quot;City name&amp;quot;}
                },
                &amp;quot;required&amp;quot;: [&amp;quot;location&amp;quot;]
            }
        )
    ]

@server.call_tool()
async def call_tool(name, arguments):
    if name == &amp;quot;get_weather&amp;quot;:
        weather = await fetch_weather_api(arguments[&amp;quot;location&amp;quot;])
        return [TextContent(type=&amp;quot;text&amp;quot;, text=str(weather))]

async def main():
    async with mcp.server.stdio.stdio_server() as (read, write):
        await server.run(read, write, server.create_initialization_options())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The protocol handles discovery, parameter validation, and response formatting. Your server just needs to define the tools it exposes and implement the logic for each one.&lt;/p&gt;
&lt;h2&gt;MCP Support Across Every Major AI Tool&lt;/h2&gt;
&lt;p&gt;Here is how MCP support works across all 17 AI tools covered in this series, grouped by support level.&lt;/p&gt;
&lt;h3&gt;Full Native MCP Support&lt;/h3&gt;
&lt;p&gt;These tools have deep, first-class MCP integration built into their core architecture.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Claude Desktop&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support (MCP was created by Anthropic for this use case)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Edit &lt;code&gt;claude_desktop_config.json&lt;/code&gt; located at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;macOS: &lt;code&gt;~/Library/Application Support/Claude/claude_desktop_config.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Windows: &lt;code&gt;%APPDATA%\Claude\claude_desktop_config.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Linux: &lt;code&gt;~/.config/Claude/claude_desktop_config.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://user@localhost:5432/mydb&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Claude Desktop acts as the MCP host. When you start the app, it launches all configured MCP servers as child processes. Claude discovers the available tools and can invoke them during conversations. You also access the settings through the app&apos;s menu: Claude &amp;gt; Settings &amp;gt; Developer &amp;gt; Edit Config. Desktop Extensions offer a simplified setup path as well.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Claude Code&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support via the &lt;code&gt;claude mcp&lt;/code&gt; CLI command&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Managed through the command line:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;claude mcp add postgres -- npx -y @anthropic/mcp-server-postgres
claude mcp list
claude mcp remove postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Claude Code is an MCP client that can connect to both STDIO and Streamable HTTP servers. MCP tools become available as part of the agent&apos;s tool set alongside its built-in file and terminal tools. The agent autonomously decides when to invoke MCP tools based on the task.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Claude CoWork&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full support through Claude Desktop&apos;s MCP configuration&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; CoWork runs within the Claude Desktop application, so it inherits all MCP server connections configured in &lt;code&gt;claude_desktop_config.json&lt;/code&gt;. CoWork can use MCP servers to access Google Drive, Gmail, databases, and other external services while performing multi-step tasks on your behalf.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;OpenAI Codex CLI&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support via &lt;code&gt;config.toml&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; MCP servers are configured globally (&lt;code&gt;~/.codex/config.toml&lt;/code&gt;) or per-project (&lt;code&gt;.codex/config.toml&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[mcp_servers.postgres]
command = &amp;quot;npx&amp;quot;
args = [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Management commands:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;codex mcp add postgres
codex mcp list
codex mcp remove postgres
codex mcp login  # for authenticated servers
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Codex CLI supports both STDIO and Streamable HTTP servers. It can also function as an MCP server itself, letting other MCP clients use Codex as a coding tool. Supports bearer token and OAuth authentication for remote servers.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Gemini CLI&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support via &lt;code&gt;settings.json&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Management: Use the &lt;code&gt;/mcp&lt;/code&gt; command within Gemini CLI for sub-commands including authentication, listing servers and tools, and enabling/disabling servers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Gemini CLI supports both local and remote MCP servers. Tools exposed by MCP servers become available to the Gemini agent, extending its capabilities beyond built-in file system and terminal tools. This is the primary mechanism for extending Gemini CLI&apos;s functionality.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Cursor&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support via settings UI&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Add MCP servers through Cursor Settings &amp;gt; MCP &amp;gt; Add New MCP Server. Supports both &lt;code&gt;stdio&lt;/code&gt; and &lt;code&gt;sse&lt;/code&gt;/HTTP transport types. You provide the server name, type, command/URL, and optional environment variables.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MCP tools become available in Cursor&apos;s Agent Mode. The AI assistant automatically invokes MCP tools when needed, or you can direct it to use specific tools by name. A green status indicator shows when the server is running. Requires Cursor version 0.4.5.9 or later.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Windsurf&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support via settings and MCP Marketplace&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Configure through Windsurf Settings &amp;gt; Cascade &amp;gt; MCP Servers, or manually edit &lt;code&gt;mcp_config.json&lt;/code&gt;. Supports &lt;code&gt;stdio&lt;/code&gt;, Streamable HTTP, and SSE transports with OAuth support.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Windsurf acts as the MCP host with Cascade as the MCP client. Up to 100 active tools can be connected at once. Windsurf provides an MCP Marketplace for discovering and installing servers. Users can auto-approve specific tools or manually review each tool call.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Google Antigravity&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; MCP servers are configured through the tool&apos;s settings following the standard MCP pattern.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MCP tool descriptions are included in the context assembly for every interaction. When the agent enters agentic mode, it can invoke MCP tools alongside its built-in tools (file system, terminal, browser). Best used when tasks require information from outside the codebase.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;OpenCode&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full native support via &lt;code&gt;opencode.jsonc&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Management: Use the &lt;code&gt;opencode mcp&lt;/code&gt; command. Supports both local and remote servers with OAuth authentication for remote connections.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MCP tools become automatically available to the LLM powering OpenCode. Local settings can override remote defaults. Important security note: local MCP servers can execute commands without confirmation, so be cautious with untrusted project configurations.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Zed&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Full support via extensions and settings&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Configure MCP servers through Zed extensions or directly in settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MCP is central to Zed&apos;s AI agent capabilities. Servers can be installed as pre-built extensions or configured as custom servers. Zed also developed the Agent Client Protocol (ACP) in collaboration with JetBrains for broader agent interoperability. MCP tools are available within the assistant panel for agentic interactions.&lt;/p&gt;
&lt;h3&gt;MCP Support Through Application Features&lt;/h3&gt;
&lt;p&gt;These tools support MCP but through specific application features rather than general-purpose configuration.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Claude Web&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Remote MCP server support via Connectors&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Navigate to &lt;strong&gt;Settings &amp;gt; Connectors&lt;/strong&gt; in the Claude.ai web interface. Add a custom connector by providing the remote MCP server&apos;s URL. Available across Free, Pro, Max, Team, and Enterprise plans (free users may have connection limits).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Claude Web connects to remote MCP servers (cloud-hosted services accessed via URL). This enables integrations with Google Drive, Slack, GitHub, Asana, Canva, Figma, and any custom API exposed through a remote MCP server. Claude Web also supports MCP Apps, which allow MCP servers to render interactive UIs (dashboards, project boards) directly within the chat interface. Note: Claude Web only supports remote servers. For local MCP servers (STDIO), use Claude Desktop.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;ChatGPT&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; MCP support via Developer Mode (September 2025)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Enable Developer Mode in ChatGPT settings (requires Plus, Pro, Team, Enterprise, or Education plan). Configure MCP server endpoints through the Developer Mode settings panel.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; ChatGPT connects to MCP servers through Developer Mode, supporting both read and write operations. This means ChatGPT can fetch data from and update external systems (Jira tickets, calendars, databases) directly from the chat interface. The desktop app supports additional local MCP connections. OpenAI cautions that write operations carry risk and should be tested carefully.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Perplexity&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Local MCP support on macOS application&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Configure local MCP servers through the macOS app settings. Remote MCP servers are planned for paid subscribers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Local MCP servers give Perplexity access to your file system, local databases, and applications, complementing its primary web search capabilities. This lets you combine Perplexity&apos;s research strengths with local data access.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;T3 Chat&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; MCP support for external data sources&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Configure through T3 Chat&apos;s settings interface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MCP servers provide T3 Chat with access to external resources including Google Drive, Slack, GitHub, databases, and custom APIs. This extends T3 Chat beyond a chat interface into an integrated workspace that can query and interact with external services.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;OpenWork&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; MCP support through plugin architecture&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Configure MCP servers through OpenWork&apos;s settings panel.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; MCP connections become available as tools that OpenWork&apos;s Skills can utilize. Each server connection extends the capabilities available to the AI agent during file and task operations.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;VS Code with LLM Plugins&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Support level:&lt;/strong&gt; Varies by extension&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Extension&lt;/th&gt;
&lt;th&gt;MCP Support&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full support&lt;/td&gt;
&lt;td&gt;Settings panel configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Continue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full support&lt;/td&gt;
&lt;td&gt;config.json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub Copilot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Through GitHub integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aider&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Uses direct terminal commands instead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Each extension manages its own MCP connections. Cline and Continue offer the most complete MCP support, with tools becoming available within their respective chat and agent interfaces.&lt;/p&gt;
&lt;h3&gt;No MCP Support&lt;/h3&gt;
&lt;p&gt;These tools do not support MCP connections directly.&lt;/p&gt;
&lt;hr&gt;
&lt;h4&gt;Gemini Web and NotebookLM&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Why not:&lt;/strong&gt; Web-based interfaces cannot manage local server processes. Google Workspace integrations (Gmail, Drive, Docs) provide similar functionality for Google services. MCP support is available in the Gemini CLI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alternative:&lt;/strong&gt; Use Gemini CLI for MCP-connected workflows. Use Gemini Web and NotebookLM for their strengths in web-based research and document analysis.&lt;/p&gt;
&lt;h2&gt;Choosing the Right MCP Configuration&lt;/h2&gt;
&lt;h3&gt;For Individual Developers&lt;/h3&gt;
&lt;p&gt;Start with 2 to 3 MCP servers that match your daily workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Filesystem MCP&lt;/strong&gt; for local file access (if your tool does not have built-in file access)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database MCP&lt;/strong&gt; for your primary database (PostgreSQL, MySQL, or SQLite)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GitHub MCP&lt;/strong&gt; for repository management&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;For Teams&lt;/h3&gt;
&lt;p&gt;Standardize on a common set of MCP servers and share configurations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Store MCP configurations in version control&lt;/li&gt;
&lt;li&gt;Use project-level configs (&lt;code&gt;.codex/config.toml&lt;/code&gt;, &lt;code&gt;opencode.jsonc&lt;/code&gt;) so the entire team connects to the same servers&lt;/li&gt;
&lt;li&gt;Document which MCP servers are required for each project&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;For Multi-Tool Workflows&lt;/h3&gt;
&lt;p&gt;The biggest advantage of MCP is portability. Set up a server once and use it everywhere:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Configure your database MCP server and use it from Claude Desktop, Cursor, and Gemini CLI&lt;/li&gt;
&lt;li&gt;Use the same GitHub MCP server across all your coding tools&lt;/li&gt;
&lt;li&gt;Create custom MCP servers for internal APIs and share them across the team&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Security Considerations&lt;/h2&gt;
&lt;p&gt;MCP servers can be powerful but carry security implications:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Local execution:&lt;/strong&gt; STDIO MCP servers run as local processes with your user permissions. A malicious server could access your file system, environment variables, or network.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write operations:&lt;/strong&gt; MCP servers that support writes (database updates, file modifications, API calls) can make changes that are difficult to undo. Always review tool calls before approving, especially for unfamiliar servers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Untrusted configurations:&lt;/strong&gt; Be cautious with project-level MCP configurations in repositories you do not control. A malicious &lt;code&gt;opencode.json&lt;/code&gt; or &lt;code&gt;.codex/config.toml&lt;/code&gt; could define servers that execute harmful commands.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; For remote MCP servers, use OAuth or bearer token authentication. Never embed credentials directly in configuration files that are committed to version control.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Approval flows:&lt;/strong&gt; Most AI tools prompt for approval before invoking MCP tools. Keep this enabled, especially for write operations. Some tools (like Windsurf) let you auto-approve specific tools while requiring manual review for others.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common MCP Patterns&lt;/h2&gt;
&lt;h3&gt;The Database Query Pattern&lt;/h3&gt;
&lt;p&gt;Connect a database MCP server and let the AI query your data directly:&lt;/p&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by revenue last quarter?&amp;quot;&lt;/p&gt;
&lt;p&gt;The AI invokes the database MCP tool, runs the appropriate SQL query, and presents the results. No manual query writing required.&lt;/p&gt;
&lt;h3&gt;The Cross-System Pattern&lt;/h3&gt;
&lt;p&gt;Connect multiple MCP servers to work across systems:&lt;/p&gt;
&lt;p&gt;&amp;quot;Create a GitHub issue for the bug we found in yesterday&apos;s Sentry errors, and add it to our Jira sprint board.&amp;quot;&lt;/p&gt;
&lt;p&gt;The AI uses Sentry MCP to find the error, GitHub MCP to create the issue, and Jira MCP to add it to the sprint.&lt;/p&gt;
&lt;h3&gt;The Local Development Pattern&lt;/h3&gt;
&lt;p&gt;Connect filesystem, database, and browser MCP servers for a complete development workflow:&lt;/p&gt;
&lt;p&gt;&amp;quot;Run the test suite, check for failures, look at the database state after the failed test, and fix the issue.&amp;quot;&lt;/p&gt;
&lt;p&gt;The AI uses terminal access for tests, database MCP for state inspection, and file access for the fix.&lt;/p&gt;
&lt;h2&gt;The Future of MCP&lt;/h2&gt;
&lt;p&gt;MCP is rapidly becoming the standard interface between AI models and external systems. With adoption by OpenAI, Anthropic, Google, Microsoft, and the Linux Foundation, the protocol is likely to expand into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;More remote server hosting:&lt;/strong&gt; Cloud-hosted MCP servers that require no local setup&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Richer authentication:&lt;/strong&gt; Enterprise SSO and role-based access for MCP connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Standardized approval workflows:&lt;/strong&gt; Consistent permission models across tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Marketplace ecosystems:&lt;/strong&gt; Cursor, Windsurf, and others are already building MCP marketplaces&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understanding MCP now positions you to take advantage of these developments as the ecosystem matures. The AI tools that support MCP today will become more capable as the server ecosystem grows, and the MCP servers you configure today will work with the AI tools of tomorrow.&lt;/p&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development, context management, and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Zed: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-zed/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-zed/</guid><description>
Zed is an open-source, GPU-accelerated code editor written in Rust. It is designed for speed and collaboration, with a built-in AI assistant that sup...</description><pubDate>Thu, 05 Mar 2026 21:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Zed is an open-source, GPU-accelerated code editor written in Rust. It is designed for speed and collaboration, with a built-in AI assistant that supports multiple LLM providers and an agent mode for autonomous multi-step development. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Zed&apos;s AI agent the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. Zed&apos;s performance advantage is significant for data work: its GPU-accelerated rendering handles large result sets and complex code without the lag common in Electron-based editors.&lt;/p&gt;
&lt;p&gt;Zed supports MCP through its settings, uses &lt;code&gt;AGENTS.md&lt;/code&gt; as its primary context file, and provides agent profiles for scoping tool access to specific workflows.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/13/zed-dremio-architecture.png&quot; alt=&quot;Zed code editor AI assistant connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Zed&lt;/h2&gt;
&lt;p&gt;If you do not already have Zed installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Zed&lt;/strong&gt; from &lt;a href=&quot;https://zed.dev/&quot;&gt;zed.dev&lt;/a&gt; (available for macOS and Linux).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by running the installer or using Homebrew: &lt;code&gt;brew install zed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; in &lt;strong&gt;Settings &amp;gt; AI&lt;/strong&gt;. Zed supports its own hosted models, Anthropic, OpenAI, Google, and Ollama for local models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by launching Zed and opening your project directory.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Zed is free and open-source under the GPL license. Its native Rust architecture makes it significantly faster than Electron-based editors, with sub-millisecond input latency and GPU-accelerated rendering.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. Zed supports MCP through its JSON settings file, where MCP servers are configured as context servers.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Zed, you configure the MCP connection through &lt;code&gt;settings.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Zed MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Zed&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Open Zed&apos;s settings (&lt;code&gt;Cmd+,&lt;/code&gt;) and add the MCP server configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: {
        &amp;quot;path&amp;quot;: &amp;quot;npx&amp;quot;,
        &amp;quot;args&amp;quot;: [
          &amp;quot;-y&amp;quot;,
          &amp;quot;@dremio/mcp-client&amp;quot;,
          &amp;quot;--url&amp;quot;,
          &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
        ]
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For project-level configuration, create a &lt;code&gt;.zed/settings.json&lt;/code&gt; file in your project root with the same structure.&lt;/p&gt;
&lt;p&gt;Zed&apos;s AI agent now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls catalog descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test by opening the agent panel and asking: &amp;quot;What tables are available in Dremio?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the dremio-mcp server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: {
        &amp;quot;path&amp;quot;: &amp;quot;uv&amp;quot;,
        &amp;quot;args&amp;quot;: [
          &amp;quot;run&amp;quot;,
          &amp;quot;--directory&amp;quot;,
          &amp;quot;/path/to/dremio-mcp&amp;quot;,
          &amp;quot;dremio-mcp-server&amp;quot;,
          &amp;quot;run&amp;quot;
        ]
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;Zed uses &lt;code&gt;AGENTS.md&lt;/code&gt; as its primary context file. Place it in your project root and reference it in agent conversations with &lt;code&gt;@agents.md&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio Context File&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio Project Context

## SQL Conventions

- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Cloud endpoint: environment variable DREMIO_URI

## Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;

## Reference

- SQL syntax: ./docs/dremio-sql-reference.md
- Python SDK: ./docs/dremioframe-patterns.md
- Table schemas: ./docs/table-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When starting a new agent session, type &lt;code&gt;@agents.md&lt;/code&gt; to load the context. Zed will include the file contents in the agent&apos;s working context.&lt;/p&gt;
&lt;h3&gt;Agent Profiles&lt;/h3&gt;
&lt;p&gt;Zed supports agent profiles for controlling which tools are available. Create a &amp;quot;Dremio Data&amp;quot; profile that enables MCP tools and file editing while restricting terminal access:&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;Settings &amp;gt; AI &amp;gt; Profiles&lt;/strong&gt;, create a profile with specific tool permissions. This is useful for separating data exploration (read-only MCP queries) from development work (full tool access).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/13/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the knowledge directory into your project. Reference it in your &lt;code&gt;AGENTS.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio conventions, read the knowledge files in ./dremio-skill/knowledge/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in &lt;code&gt;AGENTS.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own AGENTS.md Context&lt;/h2&gt;
&lt;p&gt;Create a comprehensive context file tailored to your team:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Data Context

## Environment

- Lakehouse: Dremio Cloud
- Catalog: Apache Polaris-based Open Catalog
- Architecture: Medallion (bronze → silver → gold)

## Table Schemas (updated weekly)

For exact column definitions, read ./docs/table-schemas.md

## SQL Standards

- Bronze: raw._, Silver: cleaned._, Gold: analytics.\*
- Always use TIMESTAMP, never DATE
- Validate functions against ./docs/dremio-sql-reference.md

## Common Queries

For frequently used patterns, read ./docs/common-queries.md

## Python SDK

- Use dremioframe for all Dremio connections
- Patterns: read ./docs/dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Zed&apos;s fast file loading means referencing external docs adds negligible latency. Keep the &lt;code&gt;AGENTS.md&lt;/code&gt; concise and point to detailed reference files.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Zed: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Zed&apos;s AI agent can execute complete data projects with the speed advantage of a native editor.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Open the agent panel and ask:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Show growth rates and regional breakdown.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Zed&apos;s agent uses MCP to discover tables, writes SQL, and returns results. The GPU-accelerated rendering handles large result tables without lag.&lt;/p&gt;
&lt;p&gt;Follow up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For products with negative growth, show the correlation between customer complaints and revenue decline over the last 6 months.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent maintains context and generates multi-table analytical queries.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask the agent to create a dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio gold-layer views for revenue metrics and build an HTML dashboard with Plotly.js. Include monthly trends, regional heatmap, and top customer charts. Add a dark theme, date filters, and export buttons.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent generates the complete dashboard across multiple files. Zed&apos;s multi-buffer editing lets you see all generated files side-by-side without performance degradation.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build interactive tools:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include schema browsing, data preview, SQL editor, and CSV download. Generate all files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent generates the full application. Zed&apos;s speed makes iterating on the generated code feel instantaneous.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Medallion pipeline using dremioframe. Bronze ingestion, silver cleaning with deduplication and validation, gold aggregations with business metrics. Include logging and dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent writes the pipeline following your &lt;code&gt;AGENTS.md&lt;/code&gt; conventions.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app serving Dremio gold-layer data. Add endpoints for analytics, customer segments, and product performance. Include Pydantic models and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent generates the complete API server.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, reference file pointers&lt;/td&gt;
&lt;td&gt;Teams that want speed + context control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Context&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, profiles, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with specific workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for live data access. Add &lt;code&gt;AGENTS.md&lt;/code&gt; with conventions and reference file pointers. Use agent profiles to scope tool access for different workflows.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to Zed&apos;s &lt;code&gt;settings.json&lt;/code&gt; under &lt;code&gt;context_servers&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; with your Dremio conventions.&lt;/li&gt;
&lt;li&gt;Open the agent panel and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Zed&apos;s agent accurate data context, and Zed&apos;s native performance makes data exploration and code generation feel effortless.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Windsurf: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-windsurf/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-windsurf/</guid><description>
Windsurf is an AI-native code editor built as a fork of VS Code. Its standout feature is Cascade, an agentic AI system that plans and executes multi-...</description><pubDate>Thu, 05 Mar 2026 20:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Windsurf is an AI-native code editor built as a fork of VS Code. Its standout feature is Cascade, an agentic AI system that plans and executes multi-step coding tasks autonomously. Cascade understands your entire codebase, can chain together multiple file edits, terminal commands, and tool calls in a single flow. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Cascade the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. Without this connection, Cascade treats Dremio like a generic database. With it, the agent knows your schemas, business logic encoded in views, and the correct Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;Windsurf&apos;s Cascade is especially well-suited for data projects because it can chain together discovery, querying, code generation, and testing in a single autonomous flow. Ask it to explore your Dremio catalog, identify relevant tables, write a pipeline, and generate tests : all in one prompt.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/12/windsurf-dremio-architecture.png&quot; alt=&quot;Windsurf AI editor with Cascade agent connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Windsurf&lt;/h2&gt;
&lt;p&gt;If you do not already have Windsurf installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Windsurf&lt;/strong&gt; from &lt;a href=&quot;https://windsurf.com/&quot;&gt;windsurf.com&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by running the installer. Windsurf is a VS Code fork, so all VS Code extensions and themes work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with a Windsurf account. The free tier includes limited Cascade credits; Pro provides expanded access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting File &amp;gt; Open Folder and pointing to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Cascade&lt;/strong&gt; by pressing &lt;code&gt;Cmd+L&lt;/code&gt; (macOS) or &lt;code&gt;Ctrl+L&lt;/code&gt; to access the agentic AI chat panel.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you are migrating from VS Code or Cursor, your existing extensions and settings transfer automatically.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Windsurf supports MCP natively through its Cascade settings.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Windsurf, you configure the MCP connection through the Cascade settings or &lt;code&gt;mcp_config.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Windsurf MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Windsurf&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;You have two options:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option A: Via Settings UI&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Open Windsurf Settings and navigate to &lt;strong&gt;Cascade &amp;gt; MCP&lt;/strong&gt;. Click &lt;strong&gt;Add custom server&lt;/strong&gt; and paste your Dremio MCP configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option B: Via mcp_config.json&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Create or edit &lt;code&gt;~/.codeium/windsurf/mcp_config.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart Windsurf. Cascade now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by opening Cascade and asking: &amp;quot;What tables are available in Dremio?&amp;quot; Cascade will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In &lt;code&gt;mcp_config.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Windsurf Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;Windsurf supports &lt;code&gt;.windsurfrules&lt;/code&gt; files in your project root for persistent AI instructions. These work similarly to &lt;code&gt;.cursorrules&lt;/code&gt; and are loaded into every Cascade interaction.&lt;/p&gt;
&lt;h3&gt;Project-Wide Rules&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.windsurfrules&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions

- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

# Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint: environment variable DREMIO_URI

# Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Windsurf also reads &lt;code&gt;.cursorrules&lt;/code&gt; as a fallback if no &lt;code&gt;.windsurfrules&lt;/code&gt; file is present, so if your team uses Cursor alongside Windsurf, shared rules files work across both editors.&lt;/p&gt;
&lt;h3&gt;Cascade Memory and Context&lt;/h3&gt;
&lt;p&gt;Cascade has a persistent memory system. As you work with Dremio tables, Cascade remembers the schemas, query patterns, and conventions it has encountered. This means subsequent requests in the same project get more accurate over time without needing to re-read context files.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/12/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a comprehensive skill directory with knowledge files and a &lt;code&gt;.cursorrules&lt;/code&gt; file that Windsurf reads as a fallback.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; to copy the &lt;code&gt;.cursorrules&lt;/code&gt; file and knowledge directory into your project. Windsurf will pick up the rules file automatically.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a master protocol file and browsable documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your &lt;code&gt;.windsurfrules&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in the dremio-agent-md directory.
Use the sitemaps in dremio_sitemaps/ to verify syntax before generating SQL.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own Windsurf Rules&lt;/h2&gt;
&lt;p&gt;Create a custom &lt;code&gt;.windsurfrules&lt;/code&gt; with your team&apos;s specific Dremio environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Context

## Table Schemas (updated weekly)

- For table schemas, read ./docs/table-schemas.md
- For SQL conventions, read ./docs/dremio-conventions.md
- For common queries, read ./docs/common-queries.md

## Naming Standards

- Bronze: raw._, Silver: cleaned._, Gold: analytics.\*
- Always use TIMESTAMP, never DATE
- Validate function names against docs/dremio-conventions.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Export your actual schemas from Dremio and keep them updated. Cascade&apos;s memory system means it learns your patterns over time, but explicit rules ensure consistency from the first interaction.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Windsurf: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Cascade can execute complex multi-step data projects autonomously. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Open Cascade and ask questions in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Break it down by region and show the growth rate compared to the previous quarter.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade uses MCP to discover your tables, writes the SQL, runs it against Dremio, and returns formatted results. Its multi-step nature means it can chain multiple queries together autonomously.&lt;/p&gt;
&lt;p&gt;Follow up immediately:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For products with declining growth, pull the customer reviews and support tickets. Is there a pattern between product issues and revenue decline?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade maintains full context and chains together cross-table queries without prompting. This turns the editor into a data analysis workstation.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Use Cascade for multi-step project generation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer metrics. Add date range filters, a dark theme, and export buttons. Create separate HTML, CSS, and JavaScript files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade will autonomously:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Call MCP to discover gold-layer views and schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each metric&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;index.html&lt;/code&gt; with the dashboard layout&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;styles.css&lt;/code&gt; with dark theme and responsive design&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;dashboard.js&lt;/code&gt; with Chart.js configurations&lt;/li&gt;
&lt;li&gt;Wire everything together and save to your project&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open &lt;code&gt;index.html&lt;/code&gt; in a browser for a complete interactive dashboard. Cascade&apos;s agentic flow handles the entire process without manual intervention.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build interactive tools in one prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include a schema browser with table previews, a SQL editor with syntax highlighting, CSV download, and charting for numeric columns. Generate all the files including requirements.txt and a README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade generates the full application stack and wires the components together. Run &lt;code&gt;streamlit run app.py&lt;/code&gt; for a local data explorer.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Medallion Architecture pipeline using dremioframe. Bronze: ingest raw events from S3. Silver: deduplicate, validate required fields, cast timestamps. Gold: aggregate daily metrics and build customer lifetime value calculations. Include structured logging, retry logic, and dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade writes the pipeline code, creates test files, and can execute a dry run to verify the logic against your live Dremio instance.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app that queries Dremio gold-layer views via dremioframe. Add endpoints for customer segments, revenue analytics, and cohort retention. Include Pydantic models, caching, and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade generates the complete API project. Run &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; for a local API connected to your lakehouse.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windsurf Rules&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, persistent AI instructions&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Rules&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add a &lt;code&gt;.windsurfrules&lt;/code&gt; file for Dremio conventions. Let Cascade&apos;s memory build on your patterns over time.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Windsurf&apos;s &lt;strong&gt;Cascade &amp;gt; MCP&lt;/strong&gt; settings or &lt;code&gt;mcp_config.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Open Cascade and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Cascade accurate data context, and Cascade&apos;s multi-step autonomous flows turn that context into complete data projects.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with OpenWork: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-openwork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-openwork/</guid><description>
OpenWork is an open-source desktop AI agent built on the OpenCode engine. It runs entirely on your machine with your own API keys, giving you full co...</description><pubDate>Thu, 05 Mar 2026 19:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenWork is an open-source desktop AI agent built on the OpenCode engine. It runs entirely on your machine with your own API keys, giving you full control over your data and your AI costs. Dremio is a unified lakehouse platform built on open standards like Apache Iceberg, Apache Arrow, and Apache Polaris.&lt;/p&gt;
&lt;p&gt;Both tools share a local-first philosophy. Dremio stores data in open formats with no vendor lock-in. OpenWork runs on your hardware with no cloud dependency for the agent itself. Connecting them creates an open-source analytics stack where your coding agent queries your lakehouse without sending data through third-party services.&lt;/p&gt;
&lt;p&gt;OpenWork inherits OpenCode&apos;s &lt;code&gt;AGENTS.md&lt;/code&gt; support, &lt;code&gt;opencode.json&lt;/code&gt; configuration, and MCP integration. If you have already written Dremio context files for OpenCode or OpenAI Codex, they work in OpenWork without modification. The desktop application adds a graphical interface, integrated file browser, and agent chat panel on top of the terminal experience.&lt;/p&gt;
&lt;p&gt;The local-first model has specific advantages for data work. Your Dremio queries and results stay on your machine. Your API keys are stored locally. The agent code runs in your environment. For teams that handle sensitive data or operate under compliance constraints, this architecture keeps the AI agent within your security perimeter.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, from a five-minute MCP connection to a fully custom Dremio configuration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/11/openwork-dremio-architecture.png&quot; alt=&quot;OpenWork desktop AI assistant connecting to Dremio Agentic Lakehouse&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up OpenWork&lt;/h2&gt;
&lt;p&gt;If you do not already have OpenWork installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download OpenWork&lt;/strong&gt; from &lt;a href=&quot;https://openwork.software&quot;&gt;openwork.software&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by following the platform-specific instructions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; by adding your API key (OpenAI, Anthropic, or another supported provider) in the application settings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting your project directory in the OpenWork file browser.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenWork is built on the OpenCode engine but provides a desktop GUI with an integrated file browser, agent chat panel, and visual output display. It runs entirely on your machine with your own API keys, giving you full control over costs and data privacy.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project includes a built-in MCP server. OpenWork supports MCP through its inherited &lt;code&gt;opencode.json&lt;/code&gt; configuration.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For OpenWork, you configure the MCP connection through &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;h3&gt;Find Your MCP Endpoint and Set Up OAuth&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and find your MCP URL under &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create a new application with an appropriate redirect URI.&lt;/li&gt;
&lt;li&gt;Copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure OpenWork&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Add the Dremio server to your &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Place this at your project root or globally at &lt;code&gt;~/.config/opencode/opencode.json&lt;/code&gt;. After configuration, OpenWork can call Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column details and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns JSON results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Self-Hosted MCP&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configure OpenWork to run the local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; (data exploration, default), &lt;code&gt;FOR_SELF&lt;/code&gt; (system introspection for diagnosing performance), and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; (metrics correlation). The local-first nature of OpenWork pairs well with the self-hosted MCP option, as both components run entirely on your infrastructure.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;OpenWork inherits &lt;code&gt;AGENTS.md&lt;/code&gt; support from OpenCode. The same file works in OpenWork, OpenCode, and OpenAI Codex.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused AGENTS.md&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Agent Configuration

## Dremio Lakehouse

This project uses Dremio Cloud as its lakehouse.

### SQL Conventions

- Use `CREATE FOLDER IF NOT EXISTS` (not CREATE NAMESPACE)
- Open Catalog tables: `folder.subfolder.table_name` (no catalog prefix)
- External sources: `source_name.schema.table_name`
- Cast DATE to TIMESTAMP for join consistency
- Use TIMESTAMPDIFF for duration calculations

### Credentials

- PAT: env var `DREMIO_PAT`
- Endpoint: env var `DREMIO_URI`
- Never hardcode credentials

### References

- SQL reference: https://docs.dremio.com/current/reference/sql/
- REST API: https://docs.dremio.com/current/reference/api/
- Local SQL docs: ./docs/dremio-sql-reference.md

### Terminology

- &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;OpenWork auto-scans this file at project start. Global defaults go in &lt;code&gt;~/.config/opencode/AGENTS.md&lt;/code&gt; and project-level files override them.&lt;/p&gt;
&lt;h3&gt;Cross-Tool Portability&lt;/h3&gt;
&lt;p&gt;The AGENTS.md you write for OpenWork works identically in OpenCode and OpenAI Codex. If your team uses multiple tools, you maintain one Dremio configuration file instead of separate context files for each tool.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/11/four-integration-approaches.png&quot; alt=&quot;Four integration approaches for connecting AI tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides a complete skill directory with &lt;code&gt;SKILL.md&lt;/code&gt;, knowledge files (CLI, Python SDK, SQL, REST API), and &lt;code&gt;AGENTS.md&lt;/code&gt; in the &lt;code&gt;rules/&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;For OpenWork, copy the AGENTS.md:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cp dremio-agent-skill/dremio-skill/rules/AGENTS.md ./AGENTS.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or run the full installer for broader integration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd dremio-agent-skill &amp;amp;&amp;amp; ./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; and documentation sitemaps. Clone it alongside your project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tell OpenWork: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory and use the sitemaps to validate SQL.&amp;quot; OpenWork&apos;s desktop interface makes it easy to have the agent-md folder open in the file browser while working on your project.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build a Custom Dremio Configuration&lt;/h2&gt;
&lt;h3&gt;Custom AGENTS.md with Knowledge Files&lt;/h3&gt;
&lt;p&gt;Create a project structure with reference docs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;project-root/
  AGENTS.md
  docs/
    dremio-sql-reference.md
    team-schemas.md
    dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference the docs in your AGENTS.md so OpenWork reads them on demand. Populate with your actual table schemas exported from Dremio, team-specific SQL patterns, and dremioframe code snippets.&lt;/p&gt;
&lt;h3&gt;Custom Agents&lt;/h3&gt;
&lt;p&gt;OpenWork inherits OpenCode&apos;s custom agent system. Create dedicated Dremio agents in &lt;code&gt;.opencode/agents/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# .opencode/agents/dremio-analyst.md

---

description: Dremio data analyst agent
mode: subagent

---

You are a data analyst working with Dremio Cloud.

1. Use the MCP connection to explore tables
2. Follow Dremio SQL conventions (CREATE FOLDER IF NOT EXISTS, etc.)
3. Validate function names against the SQL reference
4. Never hardcode credentials
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This subagent uses a separate model and context window dedicated to Dremio tasks, producing higher-quality SQL than a general-purpose agent.&lt;/p&gt;
&lt;h2&gt;Using Dremio with OpenWork: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, OpenWork can generate complete data applications. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Type a question in the OpenWork chat panel and get answers from your lakehouse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What is the average order value by customer segment for Q4? Which segment grew the fastest compared to Q3?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork queries Dremio through MCP, computes the comparison, and returns a formatted answer with the SQL it ran. This turns your desktop agent into a local, private data analyst that works with production data.&lt;/p&gt;
&lt;p&gt;Follow up with deeper analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the fastest-growing segment, show the top 10 customers by order frequency. Are they new customers or returning? Pull their first order date and total lifetime value.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork maintains context from the previous question and writes progressively more complex queries. Because everything runs locally, your data never leaves your machine.&lt;/p&gt;
&lt;p&gt;This pattern is especially powerful for teams with data sovereignty requirements. The AI model processes your prompt, but the data stays on your infrastructure.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask OpenWork to create a self-contained dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer views in Dremio for monthly revenue, active users, and churn rate over the last 12 months. Build an HTML dashboard with Plotly.js charts. Include filters for region and product line. Add a dark theme and export-to-PNG buttons.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each metric&lt;/li&gt;
&lt;li&gt;Generate an HTML file with Plotly.js interactive charts&lt;/li&gt;
&lt;li&gt;Add dropdown filters for region and product line&lt;/li&gt;
&lt;li&gt;Include export functionality and responsive layout&lt;/li&gt;
&lt;li&gt;Save everything to your project folder&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open it in a browser for a fully interactive dashboard running from a local file. No server required. The Plotly.js charts support zoom, pan, and hover tooltips.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build a more sophisticated tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app that connects to Dremio using dremioframe. Add a sidebar for selecting schemas and tables, a schema viewer, a data preview with pagination, a custom SQL query editor with results displayed as a table, and CSV download buttons.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork writes the full Python application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout, dremioframe connection, and query execution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; with required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;streamlit run app.py&lt;/code&gt; and you have a local data exploration tool connected to your lakehouse. Since both OpenWork and the app run on your machine, your data never leaves your infrastructure.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate your ETL workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Python script using dremioframe that reads raw CSV data from S3, creates a bronze table in Dremio, builds silver views with data quality rules (null checks, type validation, deduplication), and creates a gold view with business logic aggregations. Include error handling, logging, and a dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork uses the Dremio skill knowledge to write pipeline code that follows your team&apos;s Medallion Architecture conventions. The script includes structured logging, retry logic, and a summary report at the end.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Serve lakehouse data to other applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for device metrics, alert summaries, and historical trends. Include request validation, response caching with a 5-minute TTL, and auto-generated API docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork generates the complete server with proper error handling and connection management. Run &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; for a local API connected to your lakehouse.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog&lt;/td&gt;
&lt;td&gt;NL data exploration, building apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, cross-tool portable&lt;/td&gt;
&lt;td&gt;Multi-tool teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Broad Dremio knowledge&lt;/td&gt;
&lt;td&gt;Quick start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Config&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored agents, schemas, patterns&lt;/td&gt;
&lt;td&gt;Advanced multi-agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;OpenWork&apos;s advantage is the local-first model. Your agent, your API keys, your data connections all run on your machine. Combined with Dremio&apos;s open lakehouse formats, you get a fully controlled analytics stack.&lt;/p&gt;
&lt;p&gt;Start with the MCP server for immediate access to your data. Layer in AGENTS.md for conventions and custom agents for specialized Dremio workflows. If your team already uses OpenCode or Codex, your existing AGENTS.md and MCP configuration work in OpenWork immediately.&lt;/p&gt;
&lt;p&gt;The local-first model means you can evaluate OpenWork with Dremio without any organizational approval process. Install it on your machine, connect it to your Dremio Cloud project, and start querying. If it works for you, share the &lt;code&gt;AGENTS.md&lt;/code&gt; and &lt;code&gt;opencode.json&lt;/code&gt; files with your team so they can replicate the same setup on their machines.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to your &lt;code&gt;opencode.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and copy the AGENTS.md.&lt;/li&gt;
&lt;li&gt;Ask OpenWork to explore your catalog and build a local dashboard from your data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse provides what OpenWork needs: the semantic layer for business context, query federation for universal data access, and Reflections for interactive speed. Both platforms embrace open standards and local-first operation, making them a natural fit for teams that prioritize data sovereignty and transparency.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with OpenCode: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-opencode/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-opencode/</guid><description>
OpenCode is an open-source, terminal-based AI coding agent released under the MIT license. It provides a TUI with split panes, uses the Language Serv...</description><pubDate>Thu, 05 Mar 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenCode is an open-source, terminal-based AI coding agent released under the MIT license. It provides a TUI with split panes, uses the Language Server Protocol (LSP) for deep codebase understanding, and maintains persistent project context through file-based memory. Dremio is a unified lakehouse platform built on open standards like Apache Iceberg, Apache Arrow, and Apache Polaris.&lt;/p&gt;
&lt;p&gt;The open-source philosophy aligns. Dremio stores data in open formats with no vendor lock-in. OpenCode gives you full control over your AI coding agent with no proprietary restrictions. Connecting them means your open-source agent can query an open lakehouse, validate SQL against real schemas, and generate scripts using your team&apos;s actual conventions.&lt;/p&gt;
&lt;p&gt;OpenCode uses the same &lt;code&gt;AGENTS.md&lt;/code&gt; standard as OpenAI Codex, so the Dremio context files you write work across both tools. It also supports custom agents with dedicated prompts and model configurations, which opens up a Dremio-specific agent pattern that other tools do not offer. You can create a dedicated data analyst subagent that uses a reasoning model for SQL generation while your primary agent uses a faster model for application code.&lt;/p&gt;
&lt;p&gt;OpenCode&apos;s LSP integration gives it another advantage. The agent analyzes imports, dependencies, and file structure at the language level. When you combine this with Dremio&apos;s MCP server, the agent understands both your code structure and your data structure simultaneously.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/10/opencode-dremio-architecture.png&quot; alt=&quot;OpenCode TUI connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up OpenCode&lt;/h2&gt;
&lt;p&gt;If you do not already have OpenCode installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Go&lt;/strong&gt; (version 1.23 or later) from &lt;a href=&quot;https://go.dev/dl/&quot;&gt;go.dev&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install OpenCode&lt;/strong&gt;:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;go install github.com/opencode-ai/opencode@latest
&lt;/code&gt;&lt;/pre&gt;
Or use Homebrew: &lt;code&gt;brew install opencode&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; by setting the &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;, &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, or other model provider key in your environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch OpenCode&lt;/strong&gt; by running &lt;code&gt;opencode&lt;/code&gt; in your terminal from any project directory.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenCode provides a TUI with split panes, LSP-powered code understanding, and a multi-agent architecture that lets you define specialized subagents for different tasks. It is open-source under the MIT license.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project includes a built-in MCP server. OpenCode supports MCP natively through its &lt;code&gt;opencode.json&lt;/code&gt; configuration.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For OpenCode, you configure the MCP connection through &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Go to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt; and copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and name it (e.g., &amp;quot;OpenCode MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URI for your setup.&lt;/li&gt;
&lt;li&gt;Copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure OpenCode&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Add the Dremio MCP server to your &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For global configuration, place it in &lt;code&gt;~/.config/opencode/opencode.json&lt;/code&gt;. For project-specific config, place it at the project root.&lt;/p&gt;
&lt;p&gt;After configuring, OpenCode can call Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; lists tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns JSON results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then configure OpenCode to run the local server in &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server supports &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; (query and explore), &lt;code&gt;FOR_SELF&lt;/code&gt; (system introspection), and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; (metrics). Most coding workflows use &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt;, the default mode. It gives the agent full access to explore your catalog, read schemas, pull wiki descriptions, and run SQL queries.&lt;/p&gt;
&lt;p&gt;If your team also handles Dremio administration, &lt;code&gt;FOR_SELF&lt;/code&gt; mode lets the agent analyze job history, resource utilization, and query performance. This is useful for platform engineering tasks where you need the agent to diagnose slow queries or suggest Reflection configurations. &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; connects to your monitoring stack for correlating Dremio metrics with broader system observability.&lt;/p&gt;
&lt;p&gt;For Dremio Cloud users, the hosted MCP server is the simpler option. No local installation, OAuth-based auth, and your existing access controls apply automatically. The self-hosted server gives more control and works with on-premise Dremio Software deployments.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;OpenCode shares the &lt;code&gt;AGENTS.md&lt;/code&gt; standard with OpenAI Codex. It auto-scans for this file at project start and uses it to guide agent behavior.&lt;/p&gt;
&lt;h3&gt;AGENTS.md Placement&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Project root:&lt;/strong&gt; &lt;code&gt;AGENTS.md&lt;/code&gt; applies to the current project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.config/opencode/AGENTS.md&lt;/code&gt; applies across all projects.&lt;/li&gt;
&lt;li&gt;Project-level files override global defaults.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Writing a Dremio-Focused AGENTS.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Agent Configuration

## Dremio Lakehouse

This project uses Dremio Cloud as its lakehouse.

### SQL Conventions

- Use `CREATE FOLDER IF NOT EXISTS` for namespace creation
- Open Catalog tables: `folder.subfolder.table_name` (no catalog prefix)
- External sources: `source_name.schema.table_name`
- Cast DATE to TIMESTAMP for join consistency
- Use TIMESTAMPDIFF for duration calculations

### Credentials

- PAT: env var `DREMIO_PAT`
- Endpoint: env var `DREMIO_URI`
- Never hardcode credentials

### References

- SQL syntax: https://docs.dremio.com/current/reference/sql/
- REST API: https://docs.dremio.com/current/reference/api/
- Local SQL reference: ./docs/dremio-sql-reference.md

### Terminology

- &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run &lt;code&gt;/init&lt;/code&gt; inside OpenCode to generate a starter &lt;code&gt;AGENTS.md&lt;/code&gt; from a project scan, then add the Dremio sections above.&lt;/p&gt;
&lt;h3&gt;Custom Agents for Dremio-Specific Workflows&lt;/h3&gt;
&lt;p&gt;OpenCode supports defining custom agents in &lt;code&gt;.opencode/agents/&lt;/code&gt;. This is a capability that most other tools lack. You can create a dedicated Dremio agent with its own system prompt, model choice, and tool permissions.&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;.opencode/agents/dremio-analyst.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Dremio data analyst agent
mode: subagent
---

You are a data analyst working with Dremio Cloud. Your job is to:

1. Explore available tables using the MCP connection
2. Write SQL queries that follow Dremio conventions
3. Use TIMESTAMPDIFF, not DATEDIFF
4. Use CREATE FOLDER IF NOT EXISTS, not CREATE SCHEMA
5. Always validate function names against the SQL reference before using them
6. Never hardcode credentials; use environment variables
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This agent runs as a subagent that the primary agent can invoke for Dremio-specific tasks. You can configure it with a different model (for example, a reasoning model optimized for SQL generation) and restrict its tool access to only the Dremio MCP server.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/10/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a complete agent skill with &lt;code&gt;SKILL.md&lt;/code&gt;, knowledge files, and an &lt;code&gt;AGENTS.md&lt;/code&gt; in the &lt;code&gt;rules/&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;For OpenCode, copy the AGENTS.md from the skill to your project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cp dremio-agent-skill/dremio-skill/rules/AGENTS.md ./AGENTS.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or run the full installer for broader integration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The skill includes knowledge files covering Dremio CLI, Python SDK (dremioframe), SQL syntax, and REST API endpoints.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; protocol file and hierarchical documentation sitemaps.&lt;/p&gt;
&lt;p&gt;Clone it and tell OpenCode to read the protocol:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Instruct OpenCode: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory. Use the sitemaps to verify SQL syntax before generating code.&amp;quot;&lt;/p&gt;
&lt;p&gt;This is especially powerful with OpenCode&apos;s LSP-based context engine. The agent can cross-reference the Dremio sitemaps with your actual project imports and file structure, ensuring that the SQL it generates fits both the Dremio dialect and your project conventions.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build a Custom Dremio Agent&lt;/h2&gt;
&lt;p&gt;OpenCode&apos;s custom agent system is its differentiator for Dremio integration. While other tools limit you to context files, OpenCode lets you define a purpose-built Dremio agent.&lt;/p&gt;
&lt;h3&gt;Multi-Agent Architecture&lt;/h3&gt;
&lt;p&gt;Create a primary coding agent plus a Dremio-focused subagent:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.opencode/agents/
  dremio-analyst.md       # Subagent for SQL and data queries
  dremio-pipeline.md      # Subagent for ETL/pipeline scripts
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each agent gets its own system prompt, model configuration, and tool permissions. The primary agent delegates Dremio tasks to the appropriate subagent, which has the full Dremio context loaded while keeping the primary agent&apos;s context window focused on application code.&lt;/p&gt;
&lt;p&gt;This separation matters for large projects. A data pipeline subagent can be configured with a reasoning-capable model (like a Chain of Thought model) that excels at complex SQL generation, while your primary coding agent uses a faster model for application logic. The Dremio subagent&apos;s tool permissions can be restricted to only the Dremio MCP server, preventing it from accidentally modifying application files.&lt;/p&gt;
&lt;h3&gt;Knowledge Files&lt;/h3&gt;
&lt;p&gt;Pair your custom agents with reference documentation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docs/
  dremio-sql-reference.md
  team-schemas.md
  dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference these in both your &lt;code&gt;AGENTS.md&lt;/code&gt; and your custom agent prompts. OpenCode&apos;s file-based memory system ensures the agent retains context from these references across interactions. Export your actual table schemas from Dremio&apos;s catalog and save them as markdown. Include dremioframe code snippets for common operations like querying, creating views, and managing branches. Add REST API call patterns for your CI/CD pipelines.&lt;/p&gt;
&lt;h2&gt;Using Dremio with OpenCode: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, OpenCode&apos;s multi-agent architecture enables sophisticated data workflows. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Type a question in OpenCode&apos;s TUI and get answers from your lakehouse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which product categories have the highest return rates? Cross-reference with customer satisfaction scores and identify correlations.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode routes this to the Dremio subagent, which discovers the relevant tables via MCP, writes a multi-table join with aggregations, runs it against Dremio, and returns analysis with the underlying SQL. The primary agent stays focused on your code context while the Dremio subagent handles the data work.&lt;/p&gt;
&lt;p&gt;Dig deeper with follow-up analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the categories with highest returns, pull the top reasons from the returns table. Group by product SKU and show which specific items are driving the category-level numbers.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Dremio subagent already knows the schema context from the previous query and generates the follow-up efficiently.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask OpenCode to create a visualization:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio&apos;s gold-layer views for inventory levels, reorder rates, and supplier lead times. Build a local HTML dashboard with ECharts showing stock trends, a forecasting chart, and supplier performance scorecards. Include a responsive layout and dark theme.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode&apos;s multi-agent system handles this: the Dremio subagent writes and executes the SQL queries, while the primary agent generates the HTML/CSS/JavaScript:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Dremio subagent discovers inventory views and pulls data&lt;/li&gt;
&lt;li&gt;Primary agent generates the HTML structure with ECharts&lt;/li&gt;
&lt;li&gt;Data is embedded as JSON in the generated file&lt;/li&gt;
&lt;li&gt;Interactive filters for warehouse, category, and date range&lt;/li&gt;
&lt;li&gt;Responsive layout that works on desktop and tablet&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open it in a browser for an interactive supply chain dashboard running from a local file.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build a full application leveraging the multi-agent architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Dash app that uses dremioframe to connect to Dremio. Include a catalog browser, table schema viewer with column statistics, and a multi-tab interface for SQL queries, data profiling, and anomaly detection. Add a connection settings page.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode delegates the work: the Dremio subagent writes the dremioframe connection code and SQL queries, while the primary agent builds the Dash UI components and layout:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-tab interface with catalog browser, schema viewer, SQL editor, and profiler&lt;/li&gt;
&lt;li&gt;Column statistics calculated from Dremio metadata&lt;/li&gt;
&lt;li&gt;Anomaly detection using basic IQR analysis on numeric columns&lt;/li&gt;
&lt;li&gt;Connection settings stored in &lt;code&gt;.env&lt;/code&gt; with a settings page for updates&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;python app.py&lt;/code&gt; and your team has a local data platform connected to the lakehouse.&lt;/p&gt;
&lt;h3&gt;Generate Pipeline Scripts with Agent Collaboration&lt;/h3&gt;
&lt;p&gt;Automate data engineering with Dremio-aware code:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Using the Dremio skill, create a Python ETL pipeline that processes IoT sensor data. Create bronze tables for raw readings, silver views that apply calibration offsets and flag anomalies (readings outside 3 standard deviations), and gold views that aggregate by device and time window. Include retry logic and structured logging.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Dremio subagent writes the pipeline code using correct Dremio SQL conventions and bronze-silver-gold patterns, while the primary agent handles file management, error handling, and test generation. The result is production-quality code with proper separation of concerns.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Serve lakehouse data to downstream applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI service with endpoints for IoT device metrics, alert summaries, and historical trends. Connect to Dremio using dremioframe. Add WebSocket support for real-time data streaming and Pydantic response models.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode generates the full API server with the Dremio subagent handling query logic and the primary agent building the FastAPI framework.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog&lt;/td&gt;
&lt;td&gt;Data analysis, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, portable config&lt;/td&gt;
&lt;td&gt;Cross-tool consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Broad Dremio knowledge&lt;/td&gt;
&lt;td&gt;Quick start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Agent&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Dedicated Dremio subagent with own model/prompt&lt;/td&gt;
&lt;td&gt;Advanced multi-agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;OpenCode&apos;s custom agent system makes the fourth approach more powerful than in other tools. A dedicated Dremio subagent with its own reasoning model and restricted tool access produces higher-quality SQL than a general-purpose agent trying to handle both application code and data queries in the same context.&lt;/p&gt;
&lt;p&gt;Combine the MCP server for live data access with a custom Dremio agent for SQL generation, and an &lt;code&gt;AGENTS.md&lt;/code&gt; for project-wide conventions. This three-layer stack gives you the strongest Dremio integration available in any open-source coding tool.&lt;/p&gt;
&lt;p&gt;If you are coming from Claude Code or Codex and want an open-source alternative, start with the AGENTS.md approach since your existing file works directly in OpenCode. Add the MCP connection for live data, then explore custom agents to see if the multi-agent architecture improves your workflow.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to your &lt;code&gt;opencode.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and copy the AGENTS.md.&lt;/li&gt;
&lt;li&gt;Start OpenCode and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse provides what OpenCode&apos;s agents need for accurate analytics: the semantic layer delivers business context, query federation delivers universal data access, and Reflections deliver interactive speed. Both platforms are built on open standards, and connecting them gives you an open-source analytics stack from agent to lakehouse.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with OpenAI Codex CLI: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-openai-codex/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-openai-codex/</guid><description>
OpenAI Codex CLI is a terminal-based coding agent built in Rust. It reads your codebase, writes files, executes commands, and supports MCP for connec...</description><pubDate>Thu, 05 Mar 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenAI Codex CLI is a terminal-based coding agent built in Rust. It reads your codebase, writes files, executes commands, and supports MCP for connecting to external data services. Dremio is a unified lakehouse platform that provides the business context, universal data access, and query speed that coding agents need to produce accurate, working analytics code.&lt;/p&gt;
&lt;p&gt;Codex uses &lt;code&gt;AGENTS.md&lt;/code&gt; as its primary context file. This is an open standard designed to work across multiple AI tools, so the Dremio configuration you write for Codex also works with other AGENTS.md-compatible tools. That portability matters if your team uses different agents.&lt;/p&gt;
&lt;p&gt;Without a Dremio connection, Codex treats your lakehouse like any generic database. It may guess at table names, hallucinate SQL functions, or ignore your team&apos;s naming conventions. With a proper connection, Codex knows your schema, your business logic encoded in virtual views, and the right Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;Codex&apos;s support for the AGENTS.md open standard is worth highlighting. Unlike tool-specific context files, AGENTS.md works across multiple AI agents. Write it once for Codex and your team members using OpenCode, OpenWork, or any other AGENTS.md-compatible tool get the same context without maintaining separate files.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable. Start with the one that matches your current needs, and layer in the others as your Dremio usage grows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/09/codex-dremio-mcp-architecture.png&quot; alt=&quot;OpenAI Codex CLI connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up OpenAI Codex CLI&lt;/h2&gt;
&lt;p&gt;If you do not already have Codex CLI installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Node.js&lt;/strong&gt; (version 22 or later) from &lt;a href=&quot;https://nodejs.org/&quot;&gt;nodejs.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Codex&lt;/strong&gt; globally via npm:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;npm install -g @openai/codex
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch Codex&lt;/strong&gt; by running &lt;code&gt;codex&lt;/code&gt; in your terminal from any project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authenticate&lt;/strong&gt; with your OpenAI API key. Codex uses the &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; environment variable.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Codex runs in your terminal and reads your project files for context. It supports three autonomy modes: &lt;code&gt;suggest&lt;/code&gt; (proposes changes), &lt;code&gt;auto-edit&lt;/code&gt; (applies file edits), and &lt;code&gt;full-auto&lt;/code&gt; (runs commands without confirmation).&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Codex supports MCP natively, making this the fastest way to give the agent direct access to your data.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup including &lt;code&gt;/dremio-setup&lt;/code&gt; for step-by-step configuration. For Codex, you configure the MCP connection through your project settings:&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL from the project overview page.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s hosted MCP server uses OAuth for authentication. Your existing access controls apply to every query Codex runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and name it (e.g., &amp;quot;Codex MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URI specific to your Codex client setup.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Codex&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Codex reads MCP configuration from its settings. Add the Dremio server to your MCP configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After configuring, Codex can call Dremio&apos;s MCP tools directly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; lists available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream data dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test it by asking Codex: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call the appropriate MCP resource and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then configure Codex to run the local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system introspection, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for metrics correlation.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; mode is what you want for most coding workflows. It enables the agent to explore your catalog, read table schemas, pull wiki descriptions, and run SQL queries. The &lt;code&gt;FOR_SELF&lt;/code&gt; mode is useful for DevOps and platform engineering tasks where you need the agent to analyze Dremio&apos;s own performance metrics, job history, and resource utilization. The &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; mode connects to your monitoring stack for correlating Dremio-specific metrics with broader system observability.&lt;/p&gt;
&lt;p&gt;For Dremio Cloud users, the hosted MCP server is the simpler choice. It requires no local installation, handles authentication through OAuth, and inherits your existing access controls. The self-hosted option gives you more control and works with Dremio Software deployments that are not in the cloud.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; is an open standard for providing AI coding agents with project context. Codex auto-scans for this file at the start of every task. It defines your project structure, coding conventions, and tool-specific instructions.&lt;/p&gt;
&lt;h3&gt;How AGENTS.md Works in Codex&lt;/h3&gt;
&lt;p&gt;Codex supports layered guidance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Global defaults:&lt;/strong&gt; &lt;code&gt;~/.codex/AGENTS.md&lt;/code&gt; applies to every project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project-level:&lt;/strong&gt; &lt;code&gt;AGENTS.md&lt;/code&gt; at the repo root overrides global defaults.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nested overrides:&lt;/strong&gt; &lt;code&gt;AGENTS.override.md&lt;/code&gt; in subdirectories provides directory-specific rules that take precedence over broader ones.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This layering is useful for monorepos where different subdirectories interact with Dremio differently.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused AGENTS.md&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project Agent Configuration

## Dremio Lakehouse

This project uses Dremio Cloud as its lakehouse platform.

### SQL Conventions

- Use `CREATE FOLDER IF NOT EXISTS` for namespace creation
- Tables in the Open Catalog: `folder.subfolder.table_name` (no catalog prefix)
- External sources: `source_name.schema.table_name`
- Cast DATE columns to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

### Credentials

- Personal Access Token: use env var `DREMIO_PAT`
- Cloud endpoint: use env var `DREMIO_URI`
- Never hardcode credentials in scripts

### Documentation References

- Dremio SQL reference: https://docs.dremio.com/current/reference/sql/
- REST API: https://docs.dremio.com/current/reference/api/
- For detailed SQL validation, read ./docs/dremio-sql-reference.md

### Terminology

- Use &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; is built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also run &lt;code&gt;codex init&lt;/code&gt; to let Codex scan your project and scaffold an initial &lt;code&gt;AGENTS.md&lt;/code&gt;. Then edit it to add the Dremio-specific sections shown above.&lt;/p&gt;
&lt;h3&gt;Nested Overrides for Multi-Schema Projects&lt;/h3&gt;
&lt;p&gt;If different directories in your project target different Dremio namespaces, use &lt;code&gt;AGENTS.override.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# data-pipeline/AGENTS.override.md

## Dremio Namespace Override

All tables in this directory use the `etl_pipeline` top-level namespace.
Bronze views: etl*pipeline.bronze.*
Silver views: etl*pipeline.silver.*
Gold views: etl_pipeline.gold.\*
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This override applies only when Codex is working on files within the &lt;code&gt;data-pipeline/&lt;/code&gt; directory.&lt;/p&gt;
&lt;h3&gt;Portability Across Tools&lt;/h3&gt;
&lt;p&gt;One key advantage of AGENTS.md over tool-specific formats: the same file works with OpenCode, OpenWork, and any future tool that adopts the standard. Write it once for Codex and your team members using other AGENTS.md-compatible tools get the same Dremio context without extra setup.&lt;/p&gt;
&lt;p&gt;This portability is especially valuable for teams that are still evaluating which AI coding tool to standardize on. Rather than committing to CLAUDE.md (Claude-only) or SKILL.md (Antigravity-optimized), AGENTS.md gives you a tool-agnostic foundation that carries your Dremio conventions forward regardless of which agent your team picks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/09/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;p&gt;Two community-supported open-source repositories provide ready-made Dremio context for coding agents.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill: Full Agent Skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a comprehensive skill directory that teaches AI assistants how to interact with Dremio. It includes knowledge files for the CLI, Python SDK (dremioframe), SQL syntax, and REST API.&lt;/p&gt;
&lt;p&gt;For Codex, the skill&apos;s &lt;code&gt;rules/&lt;/code&gt; directory includes an &lt;code&gt;AGENTS.md&lt;/code&gt; file you can copy to your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cp dremio-agent-skill/dremio-skill/rules/AGENTS.md ./AGENTS.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives Codex the Dremio conventions and references without running the full skill installer. For broader integration, run the installer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; to copy the skill directory into your project, or &lt;strong&gt;Global Install (Symlink)&lt;/strong&gt; for system-wide access.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md: Documentation Protocol (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; master protocol file and a browsable sitemap of the Dremio documentation.&lt;/p&gt;
&lt;p&gt;Clone it alongside your project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then tell Codex to read the protocol file: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory. Use the sitemaps in dremio_sitemaps/ to verify Dremio syntax before generating SQL.&amp;quot;&lt;/p&gt;
&lt;p&gt;This is especially useful for SQL validation. The agent navigates the sitemaps to find correct function signatures and reserved words instead of relying on training data.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own Dremio Agent Configuration&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not match your workflow, create a custom configuration.&lt;/p&gt;
&lt;h3&gt;Custom AGENTS.md with Knowledge Files&lt;/h3&gt;
&lt;p&gt;Create a directory structure that pairs your &lt;code&gt;AGENTS.md&lt;/code&gt; with reference documents:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;project-root/
  AGENTS.md
  docs/
    dremio-sql-reference.md
    team-schemas.md
    dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your &lt;code&gt;AGENTS.md&lt;/code&gt;, reference these files so Codex reads them when needed:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Reference Documentation

- For SQL syntax rules, read docs/dremio-sql-reference.md
- For team table schemas, read docs/team-schemas.md
- For Python SDK patterns, read docs/dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Populate the knowledge files with your actual table schemas exported from Dremio, team-specific SQL patterns, and dremioframe code snippets for common operations.&lt;/p&gt;
&lt;h3&gt;Directory-Level Overrides&lt;/h3&gt;
&lt;p&gt;For monorepos, use &lt;code&gt;AGENTS.override.md&lt;/code&gt; in each subdirectory to provide namespace-specific context. The parent &lt;code&gt;AGENTS.md&lt;/code&gt; sets the Dremio conventions; the overrides specify which schemas and tables are relevant to each sub-project.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Codex: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Codex becomes a data engineering assistant in your terminal. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Type a question directly in Codex and get answers from production data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 5 underperforming regions last quarter? Compare to the same quarter last year and suggest which metrics to investigate.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex discovers your tables via MCP, writes a multi-step SQL analysis, runs it against Dremio, and returns a structured answer. You get insights from production data without opening the Dremio UI.&lt;/p&gt;
&lt;p&gt;Follow up with deeper investigation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the worst-performing region, break down the decline by product category. Is it a demand issue or a fulfillment issue? Show return rates and delivery times alongside revenue.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex maintains session context and uses the AGENTS.md conventions to write correct Dremio SQL. The layered guidance system means your global Dremio rules apply automatically.&lt;/p&gt;
&lt;p&gt;This pattern is especially powerful for engineers who live in the terminal. You can explore data, validate hypotheses, and generate insights without switching to a browser-based BI tool.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Codex to create a complete visualization:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio&apos;s gold-layer financial views for revenue, expenses, and margins by department. Build a local HTML dashboard with D3.js charts showing trends, a summary table, and conditional formatting for over/under budget departments. Add a dark theme and filter controls.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer financial views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each metric&lt;/li&gt;
&lt;li&gt;Generate an HTML file with D3.js interactive visualizations&lt;/li&gt;
&lt;li&gt;Add conditional formatting (green/red) for budget variance&lt;/li&gt;
&lt;li&gt;Include filter dropdowns for department and date range&lt;/li&gt;
&lt;li&gt;Save the complete file to your project&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open it in a browser for an interactive financial dashboard. No server required. Re-run the prompt weekly with fresh data from Dremio.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build an interactive tool for your team:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Flask app with a REST API that proxies queries to Dremio through dremioframe. Add a React frontend with a table browser, column statistics view, and a SQL sandbox where I can run ad-hoc queries. Include authentication with API keys.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex scaffolds the full-stack app with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flask backend with dremioframe connection pooling&lt;/li&gt;
&lt;li&gt;React frontend with schema browser and SQL editor&lt;/li&gt;
&lt;li&gt;API key middleware for access control&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docker-compose.yml&lt;/code&gt; for easy deployment&lt;/li&gt;
&lt;li&gt;Proper project structure with &lt;code&gt;requirements.txt&lt;/code&gt; and &lt;code&gt;package.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This pattern lets you create internal data tools quickly without a formal development cycle.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Code&lt;/h3&gt;
&lt;p&gt;Automate your ETL workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Python pipeline using dremioframe that incrementally processes new customer records. Create bronze views for raw data with TIMESTAMP casts, silver views with deduplication and email validation, and gold views with customer segmentation logic using CASE WHEN expressions. Add logging, error handling, and a summary report at the end.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex follows the Dremio conventions from your AGENTS.md and produces production-ready pipeline code. The AGENTS.md cross-tool portability means the same conventions apply whether you run this from Codex, OpenCode, or OpenWork.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Serve lakehouse data to other applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a FastAPI service that connects to Dremio and serves customer analytics. Add endpoints for cohort analysis, retention metrics, and revenue forecasting. Include request validation, response caching, and health checks.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex generates a complete API server ready for &lt;code&gt;uvicorn main:app --reload&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, doc references, portable config&lt;/td&gt;
&lt;td&gt;Teams needing cross-tool consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Config&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored to your schemas, patterns, and monorepo layout&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These approaches stack. Start with the MCP server for live data access, add an &lt;code&gt;AGENTS.md&lt;/code&gt; for Dremio conventions, and supplement with knowledge files as your team identifies recurring patterns. The layered guidance system in Codex (global, project, nested overrides) makes it easy to manage Dremio context at every level of your project hierarchy.&lt;/p&gt;
&lt;p&gt;If your team uses multiple AI coding tools, invest in the AGENTS.md approach first. It gives you a single Dremio configuration that works across tools, and you can layer in MCP for live data access from whichever agent you are using at the time.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits included).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to Codex&apos;s MCP configuration.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and copy the &lt;code&gt;AGENTS.md&lt;/code&gt; to your project root.&lt;/li&gt;
&lt;li&gt;Start Codex and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse delivers the three things Codex needs to write accurate analytics code: the semantic layer provides business context, query federation provides universal data access, and Reflections provide interactive speed. The MCP server bridges them, and &lt;code&gt;AGENTS.md&lt;/code&gt; teaches the agent your team&apos;s conventions.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with JetBrains AI Assistant: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-jetbrains-ai/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-jetbrains-ai/</guid><description>
JetBrains AI Assistant is built into IntelliJ IDEA, PyCharm, DataGrip, and every JetBrains IDE. It provides AI chat, inline code generation, multi-fi...</description><pubDate>Thu, 05 Mar 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;JetBrains AI Assistant is built into IntelliJ IDEA, PyCharm, DataGrip, and every JetBrains IDE. It provides AI chat, inline code generation, multi-file refactoring, and agentic background workers that can autonomously execute multi-step tasks. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives the AI Assistant the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. JetBrains IDEs are especially strong for data engineering: DataGrip provides native database tooling, IntelliJ supports full-stack development, and PyCharm is the standard for Python data work. Adding Dremio context to the AI Assistant turns these IDEs into data-aware development environments.&lt;/p&gt;
&lt;p&gt;A unique feature of the JetBrains ecosystem is its dual MCP role: the AI Assistant acts as an MCP client (connecting to external servers like Dremio), and the IDE itself can also act as an MCP server (exposing IDE tools to other AI clients).&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/08/jetbrains-dremio-architecture.png&quot; alt=&quot;JetBrains IntelliJ AI Assistant connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up JetBrains AI Assistant&lt;/h2&gt;
&lt;p&gt;If you do not already have JetBrains AI Assistant:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install a JetBrains IDE&lt;/strong&gt; : IntelliJ IDEA, PyCharm, DataGrip, or any other JetBrains IDE from &lt;a href=&quot;https://www.jetbrains.com/&quot;&gt;jetbrains.com&lt;/a&gt;. Community editions are free; Ultimate editions require a subscription.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Activate AI Assistant&lt;/strong&gt; : AI Assistant is included with JetBrains IDE subscriptions (2025.1+). Go to &lt;strong&gt;Settings &amp;gt; Plugins&lt;/strong&gt; and ensure &amp;quot;AI Assistant&amp;quot; is enabled.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your JetBrains account to activate the AI quota.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open the AI Chat&lt;/strong&gt; by clicking the AI Assistant icon in the right sidebar or pressing &lt;code&gt;Alt+Enter&lt;/code&gt; on a code selection.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;JetBrains AI Assistant supports multiple LLM providers. You can use JetBrains-hosted models, connect your own API keys for Anthropic or OpenAI, or run local models via OpenAI-compatible servers for privacy-sensitive environments.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. JetBrains AI Assistant supports MCP as a client starting with version 2025.1.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For JetBrains, you configure the MCP connection through the IDE settings.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;JetBrains MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure JetBrains MCP Connection&lt;/h3&gt;
&lt;p&gt;Go to &lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; Model Context Protocol (MCP)&lt;/strong&gt;. Click &lt;strong&gt;Add&lt;/strong&gt; and select the transport type:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streamable HTTP&lt;/strong&gt;: For Dremio Cloud&apos;s hosted MCP server. Enter the MCP URL directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;STDIO&lt;/strong&gt;: For the self-hosted dremio-mcp server. Enter the command and arguments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For HTTP configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Name: Dremio
Type: Streamable HTTP
URL: https://YOUR_PROJECT_MCP_URL
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After adding the server, the AI Assistant has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls catalog descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test by asking the AI chat: &amp;quot;What tables are available in Dremio?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, configure the dremio-mcp server as STDIO transport:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Name: Dremio
Type: STDIO
Command: uv
Arguments: run --directory /path/to/dremio-mcp dremio-mcp-server run
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Project Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;JetBrains AI Assistant supports project-specific rules through markdown files in &lt;code&gt;.aiassistant/rules/&lt;/code&gt;. These files provide persistent AI instructions scoped to your project.&lt;/p&gt;
&lt;h3&gt;Create Project Rules&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.aiassistant/rules/dremio.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions

This project uses Dremio Cloud as its lakehouse platform.

## SQL Rules

- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Cloud endpoint: environment variable DREMIO_URI

## Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also set rules via the IDE: &lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; Project Rules&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Custom Prompts&lt;/h3&gt;
&lt;p&gt;Create reusable prompts in the Prompt Library (&lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; Prompt Library&lt;/strong&gt;). For example, create a &amp;quot;Dremio SQL Review&amp;quot; prompt that validates SQL against Dremio conventions before execution. These prompts are available from the AI Actions menu and can be invoked on selected code.&lt;/p&gt;
&lt;h3&gt;DataGrip Integration&lt;/h3&gt;
&lt;p&gt;If you use DataGrip or the Database plugin in IntelliJ, you can connect directly to Dremio as a JDBC data source. The AI Assistant then has access to your live schema through the IDE&apos;s built-in database tools, complementing the MCP-based approach.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/08/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files and rules:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the knowledge files into your project&apos;s &lt;code&gt;.aiassistant/rules/&lt;/code&gt; directory and reference them from your project rules.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a protocol file and sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your project rules:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own Project Rules&lt;/h2&gt;
&lt;p&gt;Create a comprehensive rules setup in &lt;code&gt;.aiassistant/rules/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.aiassistant/rules/
  dremio-sql.md           # SQL conventions
  dremio-python.md        # dremioframe patterns
  dremio-schemas.md       # Team table schemas
  dremio-api.md           # REST API patterns
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Export your actual schemas from Dremio and keep them as a rule file. The AI Assistant reads all files in the &lt;code&gt;rules/&lt;/code&gt; directory and applies them to relevant interactions.&lt;/p&gt;
&lt;h2&gt;Using Dremio with JetBrains AI: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, the AI Assistant becomes a data-aware coding partner across all JetBrains IDEs.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;In the AI Chat panel, ask questions about your lakehouse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 accounts by contract value last quarter? Break down by industry vertical and show renewal rates.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI uses MCP to discover tables, writes the SQL, and returns results. In DataGrip, you can execute the generated SQL directly in the query console for additional exploration.&lt;/p&gt;
&lt;p&gt;Follow up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For accounts with renewal rates below 70%, pull their support ticket history and calculate average resolution time. Cross-reference with product usage metrics.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI maintains conversation context and generates multi-table joins.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask the AI to generate a dashboard project:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio gold-layer views for revenue, customer metrics, and churn data. Create an HTML dashboard with ECharts. Include date filters, dark theme, and regional drill-down. Generate separate files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI generates the complete dashboard. In IntelliJ or WebStorm, you can preview the HTML directly in the IDE&apos;s built-in browser.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Generate a data tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include schema browsing, SQL query editor with syntax highlighting, data preview with pagination, and CSV download. Generate requirements.txt.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In PyCharm, the AI generates the app and you can run it directly from the IDE with integrated debugging.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Use the AI for data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Medallion Architecture pipeline using dremioframe. Bronze: ingest raw data. Silver: deduplicate, validate, standardize timestamps. Gold: business metrics and KPIs. Include logging and error handling.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI generates the pipeline code following your project rules. PyCharm&apos;s debugger lets you step through the pipeline against live Dremio data.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Scaffold backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app that serves Dremio analytics through REST endpoints. Add customer segments, revenue by region, and product trends. Include Pydantic models and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;IntelliJ&apos;s HTTP client lets you test the endpoints directly from the IDE.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Project Rules&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, persistent AI context&lt;/td&gt;
&lt;td&gt;Teams with specific standards per IDE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Rules&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, custom prompts, team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with DataGrip/PyCharm workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for live data access. Add project rules for conventions. Use DataGrip&apos;s native Dremio connection for schema exploration alongside MCP.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in &lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; MCP&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;.aiassistant/rules/dremio.md&lt;/code&gt; with your SQL conventions.&lt;/li&gt;
&lt;li&gt;Open AI Chat and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives the JetBrains AI Assistant accurate data context, and the IDE&apos;s native database tooling provides complementary schema exploration and SQL execution.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Google Antigravity: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-google-antigravity/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-google-antigravity/</guid><description>
Google Antigravity is an agent-first IDE built by Google DeepMind. Its autonomous agents plan multi-step tasks, write code, browse documentation, and...</description><pubDate>Thu, 05 Mar 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google Antigravity is an agent-first IDE built by Google DeepMind. Its autonomous agents plan multi-step tasks, write code, browse documentation, and iterate without constant hand-holding. Dremio is a unified lakehouse platform that provides the business context, universal data access, and interactive query speed that AI agents need to produce accurate analytics.&lt;/p&gt;
&lt;p&gt;Connecting the two gives your Antigravity agents something most coding agents lack: direct access to your data catalog, table schemas, business logic encoded in views, and the correct SQL dialect for Dremio&apos;s query engine. Without it, the agent guesses at table names and hallucinates SQL functions. With it, the agent writes queries that actually run.&lt;/p&gt;
&lt;p&gt;Antigravity&apos;s skill system is a particularly strong fit for Dremio integration. Skills load on demand based on semantic matching, so Dremio knowledge enters the context only when the agent needs it. This keeps the context window efficient for tasks that have nothing to do with data, while still providing deep Dremio expertise when you shift to analytics work.&lt;/p&gt;
&lt;p&gt;This post walks through four integration approaches. Each one adds a different kind of context, and they combine well. You can start with the simplest option and layer in more approaches as your team&apos;s Dremio usage grows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/07/antigravity-dremio-mcp-architecture.png&quot; alt=&quot;Google Antigravity IDE connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Google Antigravity&lt;/h2&gt;
&lt;p&gt;If you do not already have Antigravity installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Antigravity&lt;/strong&gt; from the &lt;a href=&quot;https://deepmind.google/tools/&quot;&gt;Google DeepMind tools page&lt;/a&gt; or your organization&apos;s approved software catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by following the platform-specific instructions (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by launching Antigravity and pointing it to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; by adding your API key or connecting your Google Cloud account in the IDE settings.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Antigravity&apos;s agent-first design means it can plan multi-step tasks, execute shell commands, browse documentation, and iterate autonomously. Its skill system and rules engine give you fine-grained control over how agents behave.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard for AI tools to call external services. Dremio Cloud includes a built-in MCP server in every project, and Antigravity supports MCP natively through its IDE settings.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. Since Antigravity uses its own MCP configuration, you will configure the connection through the IDE settings instead.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Go to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is displayed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;The hosted MCP server uses OAuth to authenticate connections. Your existing Dremio access controls apply to every query your Antigravity agent runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name like &amp;quot;Antigravity MCP&amp;quot;.&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URI for your Antigravity setup.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Antigravity&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;In Antigravity, open the MCP settings panel from the IDE preferences. Add a new MCP server with your Dremio project URL and the OAuth client credentials. The agent will now have access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream data dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by asking your Antigravity agent: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server. Clone the repo, configure it with your Dremio instance URL and a Personal Access Token (PAT), then point Antigravity&apos;s MCP settings to the local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In Antigravity&apos;s MCP settings, configure the server to run via the local command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
  &amp;quot;args&amp;quot;: [
    &amp;quot;run&amp;quot;,
    &amp;quot;--directory&amp;quot;,
    &amp;quot;/path/to/dremio-mcp&amp;quot;,
    &amp;quot;dremio-mcp-server&amp;quot;,
    &amp;quot;run&amp;quot;
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration and SQL generation (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system performance analysis, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating Dremio metrics with your monitoring stack.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use SKILL.md and Agent Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;Antigravity&apos;s defining feature is its skill system. Skills are reusable knowledge packages that agents discover and load on demand. A skill is a directory containing a &lt;code&gt;SKILL.md&lt;/code&gt; file with YAML frontmatter for discovery and markdown instructions for the agent.&lt;/p&gt;
&lt;p&gt;The key difference from context files in other tools: Antigravity skills are loaded only when relevant. The agent reads the skill&apos;s name and description from the YAML frontmatter, semantically matches them against your prompt, and activates the skill only when it is needed. This avoids wasting context tokens on instructions the agent does not need for the current task.&lt;/p&gt;
&lt;p&gt;This architecture is called progressive disclosure. A tool like Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; into every session whether you need it or not. Antigravity loads skills selectively. For teams that use Dremio for some projects and not others, this means zero overhead on non-Dremio work.&lt;/p&gt;
&lt;h3&gt;How SKILL.md Works&lt;/h3&gt;
&lt;p&gt;A &lt;code&gt;SKILL.md&lt;/code&gt; file has two parts:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: Dremio Conventions
description: SQL syntax, REST API patterns, and credential handling for Dremio Cloud
---

# Dremio Conventions

## SQL Rules

- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE)
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins

## Credentials

- Never hardcode PATs. Use environment variable DREMIO_PAT
- Dremio Cloud endpoint: environment variable DREMIO_URI

## Reference

- For SQL syntax validation, read knowledge/sql-reference.md
- For REST API endpoints, read knowledge/rest-api.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Place this in &lt;code&gt;.agent/skills/dremio/SKILL.md&lt;/code&gt; for workspace scope or &lt;code&gt;~/.agent/skills/dremio/SKILL.md&lt;/code&gt; for global scope.&lt;/p&gt;
&lt;h3&gt;Agent Rules for Always-On Guidance&lt;/h3&gt;
&lt;p&gt;Skills activate on demand. For instructions that should apply to every session regardless of the prompt, use Antigravity&apos;s rules system. Place markdown files in &lt;code&gt;.agent/rules/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# .agent/rules/dremio-sql.md

When writing Dremio SQL:

- Never use CREATE SCHEMA or CREATE NAMESPACE. Dremio uses CREATE FOLDER IF NOT EXISTS.
- Always validate function names against the Dremio SQL reference before including them.
- Use TIMESTAMPDIFF for duration calculations, not DATEDIFF.
- Dremio is not a data warehouse. It is an Agentic Lakehouse platform.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rules load at session start, similar to &lt;code&gt;CLAUDE.md&lt;/code&gt; in Claude Code. Use rules for hard constraints (like SQL dialect rules) and skills for reference knowledge (like API documentation).&lt;/p&gt;
&lt;h3&gt;Workflows for Repetitive Dremio Tasks&lt;/h3&gt;
&lt;p&gt;Antigravity also supports workflows in &lt;code&gt;.agent/workflows/&lt;/code&gt;. These are saved prompts the agent follows step by step. For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# .agent/workflows/dremio-data-model.md

---

## description: Create a bronze-silver-gold data model in Dremio

1. Read the Dremio skill for SQL conventions
2. Create folders for bronze, silver, and gold layers
3. Create bronze views with column renames and TIMESTAMP casts
4. Create silver views joining bronze views with business logic
5. Create gold views with CASE WHEN classifications
6. Enable AI-generated wikis on gold views
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/07/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;p&gt;Two community-supported open-source repositories provide ready-made Dremio context. Antigravity has first-class support for the skill-based approach.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio offers an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude Code plugin&lt;/a&gt; for Claude-based tools, and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill: Native Antigravity Skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository is designed for tools like Antigravity. It contains a complete &lt;code&gt;dremio-skill/&lt;/code&gt; directory with &lt;code&gt;SKILL.md&lt;/code&gt;, comprehensive &lt;code&gt;knowledge/&lt;/code&gt; files (CLI, Python SDK, SQL, REST API), and configuration files for other tools.&lt;/p&gt;
&lt;p&gt;Install it globally:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Global Install (Symlink)&lt;/strong&gt; when prompted. This creates a symlink from the repo&apos;s &lt;code&gt;dremio-skill/&lt;/code&gt; directory to &lt;code&gt;~/.agent/skills/&lt;/code&gt;, making the skill available in every Antigravity workspace. When you pull updates to the repo, the skill updates automatically.&lt;/p&gt;
&lt;p&gt;After installation, start a new Antigravity session and ask it to scan for available skills. The agent will discover the Dremio skill by its name and description, and load it whenever you ask Dremio-related questions.&lt;/p&gt;
&lt;p&gt;For team projects, choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; instead. This copies the skill into your project and sets up &lt;code&gt;.agent&lt;/code&gt; symlinks so every team member who clones the repo gets the same context.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md: Documentation Protocol (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; protocol file and browsable sitemaps of the Dremio documentation.&lt;/p&gt;
&lt;p&gt;Clone it and tell your Antigravity agent to read it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then instruct the agent: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory. Use the sitemaps in dremio_sitemaps/ to verify Dremio syntax before generating any SQL.&amp;quot;&lt;/p&gt;
&lt;p&gt;This approach is useful when you need the agent to cross-reference specific documentation pages rather than rely on pre-packaged knowledge files.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own Dremio Skill&lt;/h2&gt;
&lt;p&gt;If the pre-built skill does not fit your workflow, build a custom one. Antigravity&apos;s skill system makes this straightforward.&lt;/p&gt;
&lt;h3&gt;Create the Skill Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.agent/skills/my-dremio/
  SKILL.md
  knowledge/
    sql-conventions.md
    team-schemas.md
    dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Write the SKILL.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: Team Dremio Skill
description: SQL conventions, table schemas, and dremioframe patterns for our analytics lakehouse
---

# Team Dremio Skill

## SQL Standards

- All tables are under the analytics namespace
- Bronze: analytics.bronze._, Silver: analytics.silver._, Gold: analytics.gold.\*
- Always use TIMESTAMP, never DATE
- Validate function names against knowledge/sql-conventions.md

## Authentication

- Use env var DREMIO_PAT for tokens
- Cloud endpoint: env var DREMIO_URI

## Common Tasks

- For bulk data operations, use dremioframe patterns in knowledge/dremioframe-patterns.md
- For table schemas, check knowledge/team-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Populate Knowledge Files&lt;/h3&gt;
&lt;p&gt;Export your actual table schemas from Dremio and save them as markdown in the &lt;code&gt;knowledge/&lt;/code&gt; directory. Include dremioframe code snippets your team uses frequently, REST API call patterns for your CI/CD pipeline, and SQL examples that follow your naming conventions.&lt;/p&gt;
&lt;p&gt;The advantage of a custom skill over a generic rules file: skills activate based on semantic matching. When you ask about a completely unrelated topic, the Dremio skill stays out of the context window. When you ask about data pipelines or SQL, the agent pulls it in automatically.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Antigravity: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Antigravity&apos;s agents can execute complete data projects autonomously. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Ask your Antigravity agent questions about your lakehouse in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What is the average order value by product category for the last 6 months? Show me which categories are trending up.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent uses MCP to discover relevant tables, writes and runs the SQL against Dremio, and returns formatted results with analysis. No SQL required.&lt;/p&gt;
&lt;p&gt;Take it further with multi-step analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the categories trending up, pull the top 5 products in each and compare their margins. Are we making more revenue but at lower margins?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity&apos;s skill system loads the Dremio conventions automatically when it detects a data-related question, so the SQL it generates follows your team&apos;s standards without you needing to remind it.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Give the agent a broader task:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer customer analytics views in Dremio. Build a local HTML dashboard with Plotly.js charts showing customer lifetime value distribution, churn rates by cohort, and retention curves. Include date range filters and a dark theme.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Activate the Dremio skill to understand your SQL conventions&lt;/li&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute the SQL queries&lt;/li&gt;
&lt;li&gt;Generate an HTML file with Plotly.js interactive charts&lt;/li&gt;
&lt;li&gt;Add filter controls and a responsive layout&lt;/li&gt;
&lt;li&gt;Save it to your workspace&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser for a complete dashboard running from a local file. The Plotly.js charts support zoom, pan, hover tooltips, and export to PNG.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Ask for an interactive tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a Streamlit app that connects to Dremio using dremioframe. Add a sidebar for browsing schemas and tables, a detail view showing table schemas and wiki descriptions, a SQL query editor with syntax highlighting, and a results panel with pagination and CSV download.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity writes the full Python application with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dremio catalog browser using the MCP connection for live schema data&lt;/li&gt;
&lt;li&gt;SQL editor with autocomplete based on discovered table names&lt;/li&gt;
&lt;li&gt;Paginated results display with export options&lt;/li&gt;
&lt;li&gt;Connection management using environment variables&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;streamlit run app.py&lt;/code&gt; and your team has a local data explorer without waiting for a BI tool deployment.&lt;/p&gt;
&lt;h3&gt;Automate Data Workflows with Antigravity Workflows&lt;/h3&gt;
&lt;p&gt;Use Antigravity&apos;s workflow system to create repeatable Dremio operations:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Using the Dremio skill, write a Python script that creates a bronze-silver-gold view hierarchy for our new user events table. Follow the Medallion Architecture patterns. Bronze should rename columns to snake_case and cast dates. Silver should deduplicate and validate required fields. Gold should aggregate daily active users and session duration by segment.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent references the Dremio skill for conventions and produces structured SQL and Python code. Save the prompt as an Antigravity workflow in &lt;code&gt;.agent/workflows/new-data-model.md&lt;/code&gt; so any team member can run it for new tables.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a Flask API that queries Dremio&apos;s gold-layer views. Create endpoints for customer segments, revenue trends, and product performance. Include caching with a 5-minute TTL and rate limiting. Generate OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity generates the full application with proper error handling, connection pooling via dremioframe, and production-ready configuration.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SKILL.md + Rules&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, on-demand doc references&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skill&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Skill&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored to your schemas, patterns, workflows&lt;/td&gt;
&lt;td&gt;Mature teams with specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine them for the strongest setup. Use the MCP server for live data, a pre-built skill for general Dremio knowledge, rules for hard SQL constraints, and a custom skill for your team&apos;s specific schemas and patterns.&lt;/p&gt;
&lt;p&gt;If you are evaluating Dremio for the first time, start with the MCP server. It takes five minutes and gives you immediate querying capabilities. As you develop team conventions, add rules files for the constraints that should apply universally. Once you have a stable set of patterns, package them into a custom skill that your entire team can install.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits included).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint under &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Antigravity&apos;s MCP settings panel.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt; with global symlink mode.&lt;/li&gt;
&lt;li&gt;Start a new Antigravity session and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse provides the three things Antigravity agents need for accurate analytics: the semantic layer delivers business context, query federation delivers universal data access, and Reflections deliver interactive speed. The MCP server connects them, and skills teach the agent your team&apos;s conventions.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with GitHub Copilot: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-github-copilot/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-github-copilot/</guid><description>
GitHub Copilot is the most widely adopted AI coding assistant, integrated into VS Code, JetBrains IDEs, and the GitHub platform. Its agent mode allow...</description><pubDate>Thu, 05 Mar 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;GitHub Copilot is the most widely adopted AI coding assistant, integrated into VS Code, JetBrains IDEs, and the GitHub platform. Its agent mode allows Copilot to plan and execute multi-step coding tasks, run terminal commands, and interact with external tools through MCP. The Copilot CLI extends agentic development to the terminal. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Copilot&apos;s agent mode the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. This is significant because of Copilot&apos;s massive user base: if you already use Copilot for code completion and chat, adding Dremio context turns it into a data-aware development partner without switching tools.&lt;/p&gt;
&lt;p&gt;Copilot&apos;s &lt;code&gt;copilot-instructions.md&lt;/code&gt; file and &lt;code&gt;.vscode/mcp.json&lt;/code&gt; configuration make it straightforward to integrate project-specific Dremio conventions and live data access into your workflow.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/06/copilot-dremio-architecture.png&quot; alt=&quot;GitHub Copilot agent mode in VS Code connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up GitHub Copilot&lt;/h2&gt;
&lt;p&gt;If you do not already have GitHub Copilot:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Sign up for GitHub Copilot&lt;/strong&gt; at &lt;a href=&quot;https://github.com/features/copilot&quot;&gt;github.com/features/copilot&lt;/a&gt;. Individual ($10/month), Business ($19/user/month), and Enterprise ($39/user/month) plans are available. Free tier includes limited completions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install VS Code&lt;/strong&gt; from &lt;a href=&quot;https://code.visualstudio.com/&quot;&gt;code.visualstudio.com&lt;/a&gt; if not already installed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install the GitHub Copilot extension&lt;/strong&gt; from the VS Code marketplace (search &amp;quot;GitHub Copilot&amp;quot;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your GitHub account when prompted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable agent mode&lt;/strong&gt; by clicking the Copilot chat icon and selecting &amp;quot;Agent&amp;quot; from the mode dropdown (available in VS Code 1.99+).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For terminal usage, install the &lt;strong&gt;Copilot CLI&lt;/strong&gt; (&lt;code&gt;gh copilot&lt;/code&gt;) through the GitHub CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;gh extension install github/gh-copilot
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. Copilot agent mode supports MCP natively through workspace configuration files.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Copilot, you configure the MCP connection through &lt;code&gt;.vscode/mcp.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Copilot MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Copilot&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.vscode/mcp.json&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;http&amp;quot;,
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also configure MCP servers in your VS Code user settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;servers&amp;quot;: {
      &amp;quot;dremio&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;http&amp;quot;,
        &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reload VS Code. Copilot agent mode now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by opening Copilot Chat in agent mode and asking: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;stdio&amp;quot;,
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Enterprise Policy Controls&lt;/h3&gt;
&lt;p&gt;For organizations, GitHub administrators can manage MCP server access through organization policies. This lets teams standardize on approved Dremio MCP connections while preventing unauthorized data access.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use copilot-instructions.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;Copilot reads custom instructions from &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; in your repository. This file is loaded into every Copilot interaction, providing persistent project context.&lt;/p&gt;
&lt;h3&gt;Repository-Level Instructions&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions

This project uses Dremio Cloud as its lakehouse platform.

## SQL Rules

- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Cloud endpoint: environment variable DREMIO_URI

## Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern-Specific Instructions&lt;/h3&gt;
&lt;p&gt;Copilot also supports &lt;code&gt;.instructions&lt;/code&gt; files with YAML glob patterns for targeted application:&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;.github/instructions/dremio-sql.instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
applyTo: &amp;quot;**/*.sql&amp;quot;
---

When writing SQL for Dremio:

- Validate function names against the Dremio SQL reference
- Use TIMESTAMPDIFF for duration calculations
- Cast DATE columns to TIMESTAMP before joins
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create &lt;code&gt;.github/instructions/dremio-python.instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
applyTo: &amp;quot;**/*.py&amp;quot;
---

When writing Python code that uses dremioframe:

- Import as: from dremioframe import DremioConnection
- Use environment variables for credentials
- Always close connections in a finally block
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This scoping is similar to Cursor&apos;s &lt;code&gt;.cursor/rules/*.mdc&lt;/code&gt; pattern matching.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/06/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files and a &lt;code&gt;.cursorrules&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference the knowledge files from your &lt;code&gt;copilot-instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL conventions, read the knowledge files in dremio-skill/knowledge/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a protocol file and documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio documentation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own copilot-instructions.md&lt;/h2&gt;
&lt;p&gt;Create a comprehensive instruction file with your team&apos;s Dremio environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Context

## Environment

- Lakehouse: Dremio Cloud (analytics project)
- Catalog: Apache Polaris-based Open Catalog
- Architecture: Medallion (bronze → silver → gold)

## Table Schemas

For exact column definitions, read ./docs/table-schemas.md

## SQL Standards

- Bronze: raw._, Silver: cleaned._, Gold: analytics.\*
- Always use TIMESTAMP, never DATE
- Validate function names against ./docs/dremio-sql-reference.md

## Python SDK

- Use dremioframe for all Dremio connections
- Patterns: read ./docs/dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Using Dremio with GitHub Copilot: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Copilot agent mode can execute complete data projects. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;In agent mode, ask Copilot questions about your data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by lifetime value? Show their order frequency and most recent purchase date.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot agent mode uses MCP to discover tables, writes the SQL, runs it, and returns formatted results. Because it operates within VS Code, you can immediately use the results in your code.&lt;/p&gt;
&lt;p&gt;Follow up with analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For customers with declining order frequency, correlate with support ticket volume. Are our high-value customers churning?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot maintains context across the conversation and generates cross-table queries automatically.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Copilot in agent mode:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our Dremio gold-layer views for revenue metrics, then create an HTML dashboard with Chart.js. Include monthly trends, regional breakdown, and top product charts. Add date filters and a dark theme. Save as separate HTML, CSS, and JS files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot agent mode will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Call MCP to discover views and schemas&lt;/li&gt;
&lt;li&gt;Execute queries and collect results&lt;/li&gt;
&lt;li&gt;Generate &lt;code&gt;index.html&lt;/code&gt;, &lt;code&gt;styles.css&lt;/code&gt;, and &lt;code&gt;app.js&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Wire everything together with Chart.js&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open &lt;code&gt;index.html&lt;/code&gt; in a browser for a working dashboard. Since this all happens in VS Code, you can iterate on the design with inline edits.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build interactive tools:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include schema browsing, data preview with pagination, SQL query editor, and CSV export. Generate all files and a README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot generates the full application. Run &lt;code&gt;streamlit run app.py&lt;/code&gt; for a local data explorer.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Use inline completions for data engineering:&lt;/p&gt;
&lt;p&gt;Write a comment: &lt;code&gt;# Medallion pipeline for product_events: bronze ingestion, silver cleaning, gold aggregation&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Copilot generates the complete pipeline following your &lt;code&gt;copilot-instructions.md&lt;/code&gt; conventions. Agent mode can also run the generated code against your Dremio instance to validate it.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app that serves Dremio gold-layer data through REST endpoints. Add customer analytics, revenue by region, and product performance. Include Pydantic models, caching, and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot generates the complete API. Run &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; for a local server.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;copilot-instructions.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, pattern-specific rules&lt;/td&gt;
&lt;td&gt;Teams with repository-wide standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Instructions&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add &lt;code&gt;copilot-instructions.md&lt;/code&gt; for conventions. Use &lt;code&gt;.instructions&lt;/code&gt; files for pattern-specific rules.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;.vscode/mcp.json&lt;/code&gt; with your Dremio MCP server.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; with Dremio conventions.&lt;/li&gt;
&lt;li&gt;Open Copilot in agent mode and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Copilot accurate data context. Combined with Copilot&apos;s massive user base and VS Code integration, this is the lowest-friction path to AI-powered data development for most teams.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Gemini CLI: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-gemini-cli/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-gemini-cli/</guid><description>
Gemini CLI is Google&apos;s open-source terminal-based AI agent. It runs directly in your terminal, powered by Gemini models with a 1-million token contex...</description><pubDate>Thu, 05 Mar 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Gemini CLI is Google&apos;s open-source terminal-based AI agent. It runs directly in your terminal, powered by Gemini models with a 1-million token context window. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Gemini CLI the data context it needs to write accurate Dremio SQL, generate pipeline scripts, and build applications against your lakehouse. The 1-million token context window is a significant advantage: Gemini CLI can hold your entire project, documentation, and Dremio schema context simultaneously without the context limitations that constrain other agents.&lt;/p&gt;
&lt;p&gt;Gemini CLI&apos;s &lt;code&gt;GEMINI.md&lt;/code&gt; context file system is similar to &lt;code&gt;CLAUDE.md&lt;/code&gt; in Claude Code. It loads project-specific instructions at session start and supports hierarchical scoping from global defaults to project-specific overrides. The tool also supports MCP natively, Google Search grounding for real-time documentation lookups, and built-in file and shell tools.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/05/gemini-cli-dremio-architecture.png&quot; alt=&quot;Gemini CLI terminal agent connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Gemini CLI&lt;/h2&gt;
&lt;p&gt;If you do not already have Gemini CLI installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Node.js&lt;/strong&gt; (version 18 or later) from &lt;a href=&quot;https://nodejs.org/&quot;&gt;nodejs.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Gemini CLI&lt;/strong&gt; globally via npm:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;npm install -g @anthropic-ai/gemini-cli
&lt;/code&gt;&lt;/pre&gt;
Or install from source via the &lt;a href=&quot;https://github.com/google-gemini/gemini-cli&quot;&gt;GitHub repository&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authenticate&lt;/strong&gt; by running &lt;code&gt;gemini&lt;/code&gt; in your terminal. On first launch, it will prompt you to sign in with your Google account. Gemini CLI is free to use with a Google account (rate-limited) or with a Gemini API key for higher throughput.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify the installation&lt;/strong&gt; by asking a question: &lt;code&gt;gemini &amp;quot;What is Apache Iceberg?&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Gemini CLI runs in your terminal and reads your project files for context. It can execute shell commands, edit files, browse the web via Google Search grounding, and interact with MCP servers.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Gemini CLI supports MCP natively through its &lt;code&gt;settings.json&lt;/code&gt; configuration.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Gemini CLI, you configure the MCP connection through &lt;code&gt;settings.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Gemini CLI MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URI for your setup.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Gemini CLI&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Gemini CLI reads MCP server definitions from &lt;code&gt;settings.json&lt;/code&gt;. You can configure this at two levels:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;User-level:&lt;/strong&gt; &lt;code&gt;~/.gemini/settings.json&lt;/code&gt; (applies to all projects)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project-level:&lt;/strong&gt; &lt;code&gt;.gemini/settings.json&lt;/code&gt; (applies to the current project only)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Create or edit the settings file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;httpUrl&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also add MCP servers using the CLI command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;gemini mcp add dremio --httpUrl &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart Gemini CLI. The agent now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by asking: &amp;quot;What tables are available in Dremio?&amp;quot; Gemini CLI will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your &lt;code&gt;settings.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system analysis, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating metrics with monitoring.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use GEMINI.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;Gemini CLI auto-loads &lt;code&gt;GEMINI.md&lt;/code&gt; from your project root at the start of every session. It works similarly to &lt;code&gt;CLAUDE.md&lt;/code&gt; in Claude Code, providing persistent instructions that survive across conversations.&lt;/p&gt;
&lt;h3&gt;Hierarchical Context Loading&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;GEMINI.md&lt;/code&gt; supports hierarchical scoping:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt; applies to every project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project:&lt;/strong&gt; &lt;code&gt;GEMINI.md&lt;/code&gt; in the project root applies to that specific repo.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subdirectory:&lt;/strong&gt; &lt;code&gt;GEMINI.md&lt;/code&gt; files in subdirectories provide additional context when working in those folders.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Project-level files override global ones. Subdirectory files add to the project context rather than replacing it.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused GEMINI.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project Context

This project uses Dremio Cloud as its lakehouse platform.

## Dremio SQL Conventions

- Use `CREATE FOLDER IF NOT EXISTS` (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use `folder.subfolder.table_name` without a catalog prefix
- External federated sources use `source_name.schema.table_name`
- Cast DATE columns to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint is in environment variable: DREMIO_URI

## API Reference

- REST API docs: https://docs.dremio.com/current/reference/api/
- SQL reference: https://docs.dremio.com/current/reference/sql/
- For detailed SQL validation, read ./dremio-docs/sql-reference.md

## Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; is built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Protocol Blocks for Gated Instructions&lt;/h3&gt;
&lt;p&gt;Gemini CLI supports &lt;code&gt;&amp;lt;PROTOCOL&amp;gt;&lt;/code&gt; blocks within &lt;code&gt;GEMINI.md&lt;/code&gt; for instructions that should only activate when specific conditions are met. This prevents context bloat:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;&amp;lt;PROTOCOL&amp;gt;
When the user asks about Dremio SQL or data pipelines:
1. Read ./dremio-docs/sql-reference.md for syntax validation
2. Use Dremio SQL conventions defined above
3. Always verify function names exist in the reference before using them
&amp;lt;/PROTOCOL&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Protocol blocks are a form of delayed instructions. Gemini CLI reads the protocol definition but only executes the instructions when the triggering condition is met. This is more efficient than loading all reference files at session start.&lt;/p&gt;
&lt;h3&gt;Google Search Grounding&lt;/h3&gt;
&lt;p&gt;Gemini CLI has built-in Google Search grounding, meaning it can look up real-time Dremio documentation during a session. You can instruct it in &lt;code&gt;GEMINI.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Documentation Strategy

- Before writing any Dremio SQL, use Google Search to verify the syntax
  against the latest Dremio documentation at docs.dremio.com
- If a function name is uncertain, search for it before including it
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a unique advantage over other agents. Instead of relying solely on pre-loaded context or training data, Gemini CLI can verify syntax against live documentation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/05/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a complete skill directory with &lt;code&gt;SKILL.md&lt;/code&gt;, knowledge files, and configuration files for multiple tools.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Gemini CLI, tell the agent to read the skill at session start:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Read dremio-skill/SKILL.md and use the knowledge files in dremio-skill/knowledge/ for Dremio conventions.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The skill includes knowledge files covering Dremio CLI, Python SDK (dremioframe), SQL syntax, and REST API endpoints.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a master protocol file and browsable documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your &lt;code&gt;GEMINI.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Dremio Documentation

- Read DREMIO_AGENT.md in ./dremio-agent-md/ for the Dremio protocol
- Use sitemaps in dremio_sitemaps/ to verify SQL syntax
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pairs well with Gemini CLI&apos;s Google Search grounding. The sitemaps provide structured offline references, while Search grounding provides real-time verification.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own GEMINI.md Context&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not fit your workflow, build a custom &lt;code&gt;GEMINI.md&lt;/code&gt; tailored to your team&apos;s Dremio environment.&lt;/p&gt;
&lt;h3&gt;Create Project Context Files&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.gemini/
  GEMINI.md              # Points to the reference files below
project-docs/
  dremio-conventions.md  # Team SQL rules
  table-schemas.md       # Exported schemas from Dremio
  common-queries.md      # Frequently used query patterns
  dremioframe-patterns.md # Python SDK code snippets
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Write a Comprehensive GEMINI.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Context

## SQL Standards

- All tables are under the analytics namespace
- Bronze: analytics.bronze._, Silver: analytics.silver._, Gold: analytics.gold.\*
- Always use TIMESTAMP, never DATE
- Validate function names against project-docs/dremio-conventions.md

## Authentication

- Use env var DREMIO_PAT for tokens
- Cloud endpoint: env var DREMIO_URI

## Reference Files

- SQL conventions: project-docs/dremio-conventions.md
- Table schemas (updated weekly): project-docs/table-schemas.md
- Common queries: project-docs/common-queries.md
- Python SDK patterns: project-docs/dremioframe-patterns.md

&amp;lt;PROTOCOL&amp;gt;
When writing Dremio SQL:
1. Read project-docs/table-schemas.md to verify table and column names
2. Read project-docs/dremio-conventions.md to validate function names
3. Use Google Search to verify any Dremio function not in the reference
&amp;lt;/PROTOCOL&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The 1-million token context window means Gemini CLI can hold your entire schema reference, convention guide, and query library simultaneously without truncation.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Gemini CLI: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Gemini CLI becomes a powerful data engineering partner in your terminal. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Ask Gemini CLI questions about your lakehouse in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by revenue last quarter? Show month-over-month trends and flag any with declining order frequency.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI uses the MCP connection to discover your tables, writes the SQL, runs it against Dremio, and returns formatted results with analysis. The 1-million token context window means it can hold large result sets and build on them across a session.&lt;/p&gt;
&lt;p&gt;Follow up with multi-step analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the customers with declining frequency, pull their support ticket history and calculate the correlation between ticket volume and order decline.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI maintains the full conversation context, including previous query results, and generates the follow-up query with cross-table joins. If it is unsure about a table name or column, it can use Google Search grounding to verify against live Dremio documentation.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Gemini CLI to create a complete dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer acquisition metrics. Make it filterable by date range and add a dark theme with print-to-PDF.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each visualization&lt;/li&gt;
&lt;li&gt;Generate an HTML file with embedded CSS, JavaScript, and Chart.js&lt;/li&gt;
&lt;li&gt;Add interactive filter controls and export buttons&lt;/li&gt;
&lt;li&gt;Save it to your project directory&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser for a complete dashboard running from a local file. No server or deployment needed.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build an interactive tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Streamlit app that connects to Dremio using dremioframe. Include a schema browser sidebar with table counts, a data preview with pagination, a SQL query editor with syntax highlighting and execution, and CSV download. Generate requirements.txt and README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI writes the full application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout, dremioframe connection, and query execution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; with required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup and run instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;streamlit run app.py&lt;/code&gt; and your team has a local data explorer connected to the lakehouse.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a dremioframe script that implements a Medallion Architecture pipeline for our new product_events table. Bronze: ingest raw data with column renames and TIMESTAMP casts. Silver: deduplicate on event_id, validate required fields, apply business rules. Gold: aggregate daily active products, event counts by type, and conversion funnels. Include error handling, structured logging, and a dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI uses the GEMINI.md conventions and Dremio skill knowledge to produce production-quality pipeline code. Its Google Search grounding means it can verify Dremio function syntax in real time if the reference files do not cover a specific function.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for customer segments, revenue by geography, and product performance trends. Include Pydantic response models, request validation, caching with TTL, and auto-generated OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI generates the complete API server with proper error handling and connection management. Deploy it locally with &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; or containerize for production.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GEMINI.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, protocol blocks, Search grounding&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards or project rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Context&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine them for the strongest setup. The MCP server gives live data access; GEMINI.md enforces conventions with protocol blocks; pre-built skills provide broad Dremio knowledge; and custom context files capture your team&apos;s schemas and patterns.&lt;/p&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add a &lt;code&gt;GEMINI.md&lt;/code&gt; with your SQL conventions. Use Google Search grounding as a safety net for syntax verification. As your team develops patterns, build out the context files with schemas and query libraries.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to &lt;code&gt;~/.gemini/settings.json&lt;/code&gt; or &lt;code&gt;.gemini/settings.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and tell Gemini CLI to read the skill.&lt;/li&gt;
&lt;li&gt;Start a session and ask Gemini CLI to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Gemini CLI accurate data context: the semantic layer provides business meaning, query federation provides universal access, and Reflections provide interactive speed. Gemini CLI&apos;s massive context window holds it all, and Google Search grounding provides real-time verification as a safety net.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Cursor: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-cursor/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-cursor/</guid><description>
Cursor is an AI-native code editor built as a fork of VS Code. It integrates AI directly into the editing experience with features like Chat, Compose...</description><pubDate>Thu, 05 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Cursor is an AI-native code editor built as a fork of VS Code. It integrates AI directly into the editing experience with features like Chat, Composer (multi-file editing), and inline code generation. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Cursor&apos;s AI the context it needs to write accurate Dremio SQL, generate data pipeline code, and build applications against your lakehouse. Without this connection, Cursor treats Dremio like a generic database and guesses at function names and table paths. With it, the AI knows your schemas, your business logic encoded in views, and the correct Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;Cursor&apos;s rules system is especially well-suited for Dremio integration. Rules files in &lt;code&gt;.cursor/rules/&lt;/code&gt; let you define granular, pattern-matched instructions that activate only when relevant. You can set Dremio SQL conventions to apply only when editing &lt;code&gt;.sql&lt;/code&gt; files, and dremioframe patterns to apply only in Python files that import the SDK.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/04/cursor-dremio-architecture.png&quot; alt=&quot;Cursor AI code editor connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Cursor&lt;/h2&gt;
&lt;p&gt;If you do not already have Cursor installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Cursor&lt;/strong&gt; from &lt;a href=&quot;https://www.cursor.com/&quot;&gt;cursor.com&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by running the installer. Cursor replaces or runs alongside VS Code since it is a fork with the same extension ecosystem.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with a Cursor account. The free tier includes limited AI requests; Pro ($20/month) provides unlimited access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting File &amp;gt; Open Folder and pointing to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify AI access&lt;/strong&gt; by pressing &lt;code&gt;Cmd+K&lt;/code&gt; (macOS) or &lt;code&gt;Ctrl+K&lt;/code&gt; (Windows/Linux) to open the inline AI prompt. Type a question to confirm the AI is responding.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Cursor supports all VS Code extensions, themes, and keybindings. If you are migrating from VS Code, your existing setup transfers automatically.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Cursor supports MCP natively through its settings panel.&lt;/p&gt;
&lt;p&gt;For Claude-based tools like Claude Code, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Cursor, you configure the MCP connection through Cursor&apos;s built-in MCP settings.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s hosted MCP server uses OAuth for authentication. Your existing access controls apply to every query the AI runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Cursor MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URIs for Claude:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://claude.com/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Cursor&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;In Cursor, go to &lt;strong&gt;Settings &amp;gt; MCP&lt;/strong&gt;. Click &lt;strong&gt;Add new MCP server&lt;/strong&gt; and configure it with your Dremio project&apos;s MCP URL. You can also add the MCP server by creating a &lt;code&gt;.cursor/mcp.json&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart Cursor. The AI now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns an index of available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream data dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by opening Cursor Chat (&lt;code&gt;Cmd+L&lt;/code&gt;) and asking: &amp;quot;What tables are available in Dremio?&amp;quot; The AI will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server. Clone the repo, configure it, then add it to Cursor&apos;s MCP settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In &lt;code&gt;.cursor/mcp.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system analysis, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating metrics with monitoring.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use Cursor Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;Cursor&apos;s rules system is one of its strongest differentiators. Rules are markdown files in &lt;code&gt;.cursor/rules/&lt;/code&gt; that provide persistent AI instructions. Unlike a single monolithic context file, Cursor rules support pattern matching, so you can scope instructions to specific file types or directories.&lt;/p&gt;
&lt;h3&gt;Project-Wide Rules with .cursorrules&lt;/h3&gt;
&lt;p&gt;The simplest approach is a &lt;code&gt;.cursorrules&lt;/code&gt; file in your project root. This loads into every AI interaction:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions

- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

# Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint: environment variable DREMIO_URI

# Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern-Matched Rules with .cursor/rules/&lt;/h3&gt;
&lt;p&gt;For more granular control, create rule files in &lt;code&gt;.cursor/rules/&lt;/code&gt; with &lt;code&gt;.mdc&lt;/code&gt; (Markdown Cursor) extension. These files support YAML-like frontmatter that tells Cursor when to activate the rule:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Dremio SQL conventions for query files
globs: [&amp;quot;**/*.sql&amp;quot;, &amp;quot;**/queries/**&amp;quot;]
alwaysApply: false
---

# Dremio SQL Rules

When writing or modifying SQL files for Dremio:

- Use CREATE FOLDER IF NOT EXISTS, never CREATE SCHEMA
- Validate function names against the Dremio SQL reference
- Use TIMESTAMPDIFF for duration calculations, not DATEDIFF
- Cast DATE columns to TIMESTAMP before joins
- Reference tables as folder.subfolder.table_name
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a separate rule for Python SDK usage:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: dremioframe Python SDK patterns
globs: [&amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# dremioframe Conventions

When writing Python code that uses dremioframe:

- Import as: from dremioframe import DremioConnection
- Use environment variables for credentials: DREMIO_PAT, DREMIO_URI
- Always close connections in a finally block or use context managers
- For bulk operations, use df.to_dremio() with batch_size parameter
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;globs&lt;/code&gt; field ensures these rules only activate when editing matching files. The &lt;code&gt;alwaysApply: false&lt;/code&gt; setting means the AI loads them on demand rather than consuming context tokens on every interaction.&lt;/p&gt;
&lt;h3&gt;Referencing External Documentation&lt;/h3&gt;
&lt;p&gt;Keep rules files concise by pointing to reference documents:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Dremio documentation references
globs: [&amp;quot;**/*.sql&amp;quot;, &amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# Dremio Reference Docs

- For SQL syntax details, read `./docs/dremio-sql-reference.md`
- For Python SDK usage, read `./docs/dremioframe-guide.md`
- For REST API endpoints, read `./docs/dremio-rest-api.md`
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cursor loads the referenced files only when the AI needs them, keeping the context window efficient.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/04/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a comprehensive skill directory with knowledge files and a &lt;code&gt;.cursorrules&lt;/code&gt; file specifically designed for Cursor.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; to copy the &lt;code&gt;.cursorrules&lt;/code&gt; file and knowledge directory into your project. The &lt;code&gt;.cursorrules&lt;/code&gt; file provides Dremio conventions, and the &lt;code&gt;knowledge/&lt;/code&gt; directory contains detailed references for CLI, Python SDK, SQL syntax, and REST API.&lt;/p&gt;
&lt;p&gt;After installation, Cursor automatically picks up the &lt;code&gt;.cursorrules&lt;/code&gt; file and uses it for all AI interactions in the project.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; protocol file and browsable sitemaps of the Dremio documentation.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your &lt;code&gt;.cursorrules&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in the dremio-agent-md directory.
Use the sitemaps in dremio_sitemaps/ to verify syntax before generating SQL.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own Cursor Rules&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not fit your workflow, create a custom rules setup tailored to your team&apos;s Dremio environment.&lt;/p&gt;
&lt;h3&gt;Create Rule Files&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.cursor/rules/
  dremio-sql.mdc          # SQL conventions
  dremio-python.mdc       # dremioframe patterns
  dremio-schemas.mdc      # Team-specific table schemas
  dremio-api.mdc          # REST API patterns
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Populate with Team Context&lt;/h3&gt;
&lt;p&gt;Export your actual table schemas from Dremio and save them as a rule:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Team Dremio table schemas
globs: [&amp;quot;**/*.sql&amp;quot;, &amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# Team Table Schemas

## analytics.gold.customer_metrics

- customer_id: VARCHAR (primary key)
- lifetime_value: DECIMAL(10,2)
- segment: VARCHAR (values: &apos;enterprise&apos;, &apos;mid-market&apos;, &apos;smb&apos;)
- last_order_date: TIMESTAMP
- churn_risk_score: FLOAT

## analytics.gold.revenue_daily

- date_key: TIMESTAMP
- product_category: VARCHAR
- region: VARCHAR
- revenue: DECIMAL(12,2)
- orders: INT
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives Cursor exact schema knowledge for your project, so the AI generates SQL with correct column names and types instead of guessing.&lt;/p&gt;
&lt;h3&gt;Add Notepads for Reference Knowledge&lt;/h3&gt;
&lt;p&gt;Cursor also supports &lt;strong&gt;Notepads&lt;/strong&gt; for longer reference documents. Create a notepad in &lt;code&gt;.cursor/notepads/dremio-reference.md&lt;/code&gt; with comprehensive documentation. Notepads are available as &lt;code&gt;@notepad&lt;/code&gt; references in Chat and Composer but do not auto-load, keeping your context efficient.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Cursor: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Cursor becomes a powerful data development environment. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Open Cursor Chat (&lt;code&gt;Cmd+L&lt;/code&gt;) and ask questions in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Break it down by region and show the trend.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor uses the MCP connection to discover your tables, writes the SQL in the chat, and can run it against Dremio to return results. You get answers without switching to the Dremio UI.&lt;/p&gt;
&lt;p&gt;Follow up with deeper analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which of those top products has declining margins? Pull cost and revenue data for the last 6 months and show the margin trend.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor maintains context across the chat session, building on previous results. This turns the editor into a conversational data analysis tool.&lt;/p&gt;
&lt;p&gt;For teams with non-SQL users, Cursor Chat provides a natural language interface to the lakehouse directly inside the development environment.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Use Cursor Composer (&lt;code&gt;Cmd+I&lt;/code&gt;) for multi-file generation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer acquisition metrics. Make it filterable by date range. Put the HTML, CSS, and JavaScript in separate files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor Composer will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;index.html&lt;/code&gt; with the dashboard layout&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;styles.css&lt;/code&gt; with the dark theme and responsive design&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;app.js&lt;/code&gt; with Chart.js configurations and data fetching&lt;/li&gt;
&lt;li&gt;Embed query results as JSON data files&lt;/li&gt;
&lt;li&gt;Add interactive filter controls&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open &lt;code&gt;index.html&lt;/code&gt; in a browser for a complete dashboard running from local files. Cursor Composer excels at multi-file generation, making it ideal for this kind of project scaffolding.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build an interactive tool using Composer:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app that connects to Dremio using dremioframe. Include a schema browser sidebar, a data preview tab with pagination, a SQL query editor with syntax highlighting, and CSV download buttons. Generate requirements.txt and a README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor generates the full application across multiple files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout and dremioframe integration&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; with required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup and run instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;pip install -r requirements.txt &amp;amp;&amp;amp; streamlit run app.py&lt;/code&gt; for a local data explorer connected to your lakehouse.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering with inline AI:&lt;/p&gt;
&lt;p&gt;Highlight a comment in your Python file like &lt;code&gt;# Create bronze-silver-gold pipeline for user_events table&lt;/code&gt; and press &lt;code&gt;Cmd+K&lt;/code&gt;. Cursor generates the complete pipeline code inline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bronze: raw data ingestion with column renames and TIMESTAMP casts&lt;/li&gt;
&lt;li&gt;Silver: deduplication, null checks, and type validation&lt;/li&gt;
&lt;li&gt;Gold: business logic aggregations with CASE WHEN classifications&lt;/li&gt;
&lt;li&gt;Error handling with retry logic&lt;/li&gt;
&lt;li&gt;Structured logging for monitoring&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The inline generation respects your &lt;code&gt;.cursor/rules/&lt;/code&gt; Dremio conventions, so the SQL follows your team&apos;s standards automatically.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Use Composer to scaffold a REST API:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for customer segments, revenue analytics, and product performance. Include Pydantic models, request validation, response caching, and auto-generated OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor generates the complete API across multiple files with proper project structure, ready for &lt;code&gt;uvicorn main:app --reload&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor Rules&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, pattern-matched context&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards per file type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Rules&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine them for the strongest setup. The MCP server gives live data access; Cursor rules enforce conventions scoped to relevant file types; pre-built skills provide broad Dremio knowledge; and custom rules capture your team&apos;s specific schemas and patterns.&lt;/p&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add a &lt;code&gt;.cursorrules&lt;/code&gt; file for project-wide conventions. As your team develops specific patterns, create &lt;code&gt;.cursor/rules/*.mdc&lt;/code&gt; files with pattern matching for granular control.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Cursor&apos;s &lt;strong&gt;Settings &amp;gt; MCP&lt;/strong&gt; or create &lt;code&gt;.cursor/mcp.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt; with local project install.&lt;/li&gt;
&lt;li&gt;Open Cursor Chat and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Cursor&apos;s AI accurate data context: the semantic layer provides business meaning, query federation provides universal access, and Reflections provide interactive speed. Cursor&apos;s rules system scopes that context intelligently, activating Dremio knowledge only when relevant.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Claude CoWork: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-claude-cowork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-claude-cowork/</guid><description>
Claude CoWork is Anthropic&apos;s desktop agentic assistant. Unlike Claude Code (a terminal coding agent), CoWork operates as a general-purpose autonomous...</description><pubDate>Thu, 05 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude CoWork is Anthropic&apos;s desktop agentic assistant. Unlike Claude Code (a terminal coding agent), CoWork operates as a general-purpose autonomous agent that reads and writes files, browses the web, manages tasks, and generates complete project artifacts. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections.&lt;/p&gt;
&lt;p&gt;CoWork&apos;s strength is autonomous project execution. Give it a goal and grant it folder access, and it works through the steps independently. For data teams, this means CoWork can query your Dremio lakehouse, analyze the results, build a local dashboard, and write a summary report without you watching over every step.&lt;/p&gt;
&lt;p&gt;The context mechanism in CoWork differs from code editors. There is no &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;AGENTS.md&lt;/code&gt; file. CoWork uses folder instructions and global instructions configured through the Claude Desktop app. This makes the integration approach different, but the end result is the same: an agent that understands your Dremio environment.&lt;/p&gt;
&lt;p&gt;CoWork also has a unique advantage for Dremio users who are not developers. Because CoWork is a desktop assistant rather than a coding tool, analysts and business users can use it to ask natural language questions about their lakehouse data. The MCP connection handles the SQL generation and execution behind the scenes.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, from the quickest MCP connection to building a full Dremio knowledge folder.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/03/cowork-dremio-architecture.png&quot; alt=&quot;Claude CoWork desktop assistant connecting to Dremio Agentic Lakehouse&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Claude CoWork&lt;/h2&gt;
&lt;p&gt;If you do not already have CoWork set up:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Claude Desktop&lt;/strong&gt; from &lt;a href=&quot;https://claude.ai/download&quot;&gt;claude.ai/download&lt;/a&gt; (available for macOS and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your Anthropic account (Pro, Team, or Enterprise subscription required for CoWork features).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable CoWork&lt;/strong&gt; in the Claude Desktop app under &lt;strong&gt;Settings &amp;gt; Features &amp;gt; CoWork&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Grant folder access&lt;/strong&gt; by clicking &lt;strong&gt;Add Folder&lt;/strong&gt; and selecting your project directory.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CoWork operates as a desktop assistant, not a terminal tool. You interact with it through the Claude Desktop interface, describe tasks in natural language, and it autonomously reads files, writes code, browses the web, and generates project artifacts.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project includes a built-in MCP server. CoWork supports MCP through Claude Desktop&apos;s connector system.&lt;/p&gt;
&lt;p&gt;Dremio also provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin for Claude&lt;/a&gt; that streamlines setup. If you use Claude Code alongside CoWork, you can install the plugin directly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/plugin marketplace add dremio/claude-plugins
/plugin install dremio@dremio-plugins
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file with your credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DREMIO_PAT=&amp;lt;your_personal_access_token&amp;gt;
DREMIO_PROJECT_ID=&amp;lt;your_project_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add the Dremio MCP server through the &lt;a href=&quot;https://claude.ai&quot;&gt;Claude web interface&lt;/a&gt; under &lt;strong&gt;Customize &amp;gt; Connectors &amp;gt; Add custom connector&lt;/strong&gt;. CoWork automatically inherits MCP connections configured through the Claude web interface. Run &lt;code&gt;/dremio-setup&lt;/code&gt; in Claude Code for step-by-step guidance.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and name it (e.g., &amp;quot;Claude CoWork&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URIs for Claude:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://claude.com/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure the MCP Connector&lt;/h3&gt;
&lt;p&gt;In Claude Desktop, open &lt;strong&gt;Settings &amp;gt; Connectors&lt;/strong&gt;. Add a custom MCP connector with your Dremio project&apos;s MCP URL and the OAuth client ID. CoWork will now have access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; lists available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names and types.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test it by telling CoWork: &amp;quot;Connect to Dremio and list the available tables in my project.&amp;quot; The agent will use the MCP tools to browse your catalog.&lt;/p&gt;
&lt;h3&gt;Self-Hosted MCP&lt;/h3&gt;
&lt;p&gt;For Dremio Software, configure the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server in Claude Desktop&apos;s &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Folder Instructions for Dremio Context&lt;/h2&gt;
&lt;p&gt;CoWork uses a folder-based context model. When you grant CoWork access to a folder, you can set instructions that apply whenever the agent works within that folder.&lt;/p&gt;
&lt;h3&gt;Setting Global Dremio Instructions&lt;/h3&gt;
&lt;p&gt;In Claude Desktop, go to &lt;strong&gt;Settings &amp;gt; CoWork &amp;gt; Global Instructions&lt;/strong&gt;. Add Dremio conventions that apply to every task:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;When working with Dremio:
- Use CREATE FOLDER IF NOT EXISTS, not CREATE NAMESPACE
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Never hardcode Personal Access Tokens; use environment variables
- Dremio is an Agentic Lakehouse, not a data warehouse
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Setting Folder-Specific Instructions&lt;/h3&gt;
&lt;p&gt;When you grant CoWork access to a project folder, add instructions specific to that project:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;This folder contains a Dremio analytics project.
- Read dremio-docs/sql-reference.md before writing any SQL
- All tables are under the analytics namespace
- Bronze: analytics.bronze.*, Silver: analytics.silver.*, Gold: analytics.gold.*
- Use environment variable DREMIO_PAT for authentication
- Use environment variable DREMIO_URI for the Dremio endpoint
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Folder instructions load whenever CoWork operates in that directory, giving it project-specific context on top of the global Dremio defaults.&lt;/p&gt;
&lt;h3&gt;Agentic Memories&lt;/h3&gt;
&lt;p&gt;CoWork creates &amp;quot;agentic memories&amp;quot; as it works. After a few sessions with Dremio, CoWork builds persistent knowledge about your table schemas, common query patterns, and the SQL conventions it should follow. These memories survive across sessions, so the agent improves over time.&lt;/p&gt;
&lt;p&gt;For example, after CoWork runs its first few Dremio queries in a project, it remembers which tables exist, which columns tend to be useful, and which SQL patterns work best. The next time you ask a question, CoWork draws on this accumulated knowledge to write better queries faster.&lt;/p&gt;
&lt;p&gt;This is equivalent to CLAUDE.md or AGENTS.md but generated automatically rather than written by hand. For teams that do not want to maintain context files manually, agentic memories provide a self-improving alternative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/03/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Load Pre-Built Dremio Docs into CoWork&lt;/h2&gt;
&lt;p&gt;Two community-supported open-source repositories provide Dremio context that CoWork can read directly.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; The &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;dremio/claude-plugins&lt;/a&gt; plugin and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; are officially maintained by Dremio. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-md: Best Fit for CoWork (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository is the best fit for CoWork. It contains &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; (a master protocol file) and &lt;code&gt;dremio_sitemaps/&lt;/code&gt; (hierarchical documentation indices).&lt;/p&gt;
&lt;p&gt;Clone it and grant CoWork access to the folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set the folder instructions to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Before answering any Dremio questions, read DREMIO_AGENT.md in this folder.
Use the sitemaps in dremio_sitemaps/ to verify SQL syntax and find documentation.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;CoWork will read the protocol file, learn the SQL conventions, and use the sitemaps to validate any Dremio queries it generates.&lt;/p&gt;
&lt;h3&gt;dremio-agent-skill: Knowledge Files (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides a broader set of knowledge files covering CLI, Python SDK, SQL, and REST API:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Grant CoWork access to this folder and set instructions to: &amp;quot;Read dremio-skill/SKILL.md for Dremio capabilities. Reference the knowledge/ directory for SQL syntax, REST API, and Python SDK documentation.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Approach 4: Build a Custom Dremio Knowledge Folder&lt;/h2&gt;
&lt;p&gt;Create a dedicated folder with everything CoWork needs for your Dremio project:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dremio-context/
  README.md               # Overview and instructions
  sql-conventions.md       # Team SQL rules
  table-schemas.md         # Exported schemas from Dremio
  common-queries.md        # Frequently used query patterns
  dremioframe-examples.md  # Python SDK code snippets
  rest-api-patterns.md     # API call examples
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Write a &lt;code&gt;README.md&lt;/code&gt; that tells CoWork how to use the folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio Project Context

Read this folder to understand our Dremio setup before working on data tasks.

## Quick Reference

- SQL conventions: sql-conventions.md
- Table schemas: table-schemas.md (updated weekly)
- Common queries: common-queries.md
- Python SDK: dremioframe-examples.md
- REST API: rest-api-patterns.md

## Rules

- Always use CREATE FOLDER IF NOT EXISTS
- Use TIMESTAMPDIFF for duration calculations
- Credentials are in environment variables, never hardcoded
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Grant CoWork access to this folder and set folder instructions to: &amp;quot;Before any Dremio task, read README.md in the dremio-context folder.&amp;quot;&lt;/p&gt;
&lt;p&gt;Export your actual table schemas from Dremio regularly and update &lt;code&gt;table-schemas.md&lt;/code&gt;. Include the queries your team runs most often in &lt;code&gt;common-queries.md&lt;/code&gt;. This grows into a living knowledge base that CoWork uses to generate increasingly accurate output.&lt;/p&gt;
&lt;h2&gt;Using Dremio with CoWork: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, CoWork can execute complete data projects autonomously. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Ask CoWork plain questions about your data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Break it down by region.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork uses the MCP connection to discover relevant tables, writes the SQL, runs the query against Dremio, and returns formatted results with analysis. No SQL knowledge required on your part.&lt;/p&gt;
&lt;p&gt;Take it further with follow-up questions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which of those top products had the highest return rates? Pull the return reasons and show the most common issues.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork remembers the previous results and builds on them. Its agentic memory system stores what it learns about your tables, so subsequent questions in the same project get faster, more accurate answers.&lt;/p&gt;
&lt;p&gt;This pattern is especially valuable for non-technical users. Business analysts, product managers, and executives can use CoWork to query the lakehouse without learning SQL or navigating the Dremio UI.&lt;/p&gt;
&lt;h3&gt;Build Locally Running Dashboards&lt;/h3&gt;
&lt;p&gt;Tell CoWork to build a complete dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio, then build me a local HTML dashboard with charts showing monthly revenue trends, top customers, and regional breakdowns. Use Chart.js for the visualizations. Add date filters and a dark theme.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each visualization&lt;/li&gt;
&lt;li&gt;Generate an HTML file with embedded CSS, JavaScript, and Chart.js&lt;/li&gt;
&lt;li&gt;Add interactive filter controls for date range and region&lt;/li&gt;
&lt;li&gt;Save it to your project folder&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser, and you have a working dashboard running entirely from a local file. Share it with stakeholders by dropping it in Slack or email. No server or deployment needed.&lt;/p&gt;
&lt;p&gt;For recurring dashboards, tell CoWork to regenerate it weekly. Its agentic memory remembers the queries and file structure from the previous run.&lt;/p&gt;
&lt;h3&gt;Create Data Exploration Apps&lt;/h3&gt;
&lt;p&gt;Ask CoWork to build a more interactive tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Flask app that connects to Dremio using dremioframe. It should let me type a table name and see the schema, preview 100 rows, and run custom SQL queries. Include a clean UI with syntax highlighting and CSV download.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork writes the Python code, creates the HTML templates, and generates a &lt;code&gt;requirements.txt&lt;/code&gt;. Run &lt;code&gt;pip install -r requirements.txt &amp;amp;&amp;amp; python app.py&lt;/code&gt; and you have a local data exploration app connected to your lakehouse.&lt;/p&gt;
&lt;p&gt;This is especially useful for teams who need quick internal tools without going through a formal development cycle.&lt;/p&gt;
&lt;h3&gt;Generate Automated Reports&lt;/h3&gt;
&lt;p&gt;Schedule CoWork to generate recurring analytical reports:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query this week&apos;s data quality metrics from Dremio&apos;s gold layer, compare them to last week, and write a markdown report with tables and recommendations. Include row count trends, null percentages by column, and any columns that exceeded the 5% null threshold. Save it to the reports/ folder.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork runs the queries, computes the comparisons, generates a formatted report with tables, and writes recommendations based on the data. The report is ready to share with stakeholders without any manual analysis.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services that serve lakehouse data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that queries Dremio&apos;s gold-layer views and serves customer analytics through REST endpoints. Add endpoints for customer segments, revenue by geography, and cohort retention. Include request validation and JSON response formatting.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork generates the full application with proper error handling and dremioframe connection management. Deploy it locally or containerize it for production.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Connector&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog access&lt;/td&gt;
&lt;td&gt;Natural language data exploration, ad-hoc analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Folder Instructions&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, project context&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Docs&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge&lt;/td&gt;
&lt;td&gt;Quick setup with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Knowledge Folder&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, queries, and patterns&lt;/td&gt;
&lt;td&gt;Mature teams with specific data models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP connector. It gives CoWork live data access in five minutes, and you can immediately start asking natural language questions. Add folder instructions and knowledge files as you develop team-specific conventions.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits).&lt;/li&gt;
&lt;li&gt;Set up OAuth and configure the MCP connector in Claude Desktop.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; and grant CoWork folder access.&lt;/li&gt;
&lt;li&gt;Ask CoWork to explore your Dremio catalog.&lt;/li&gt;
&lt;li&gt;Try: &amp;quot;Query my sales data in Dremio and build a local dashboard with Chart.js.&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives CoWork the data foundation it needs: the semantic layer provides business context, query federation provides universal data access, and Reflections provide interactive speed. CoWork&apos;s autonomous execution model turns that data access into complete deliverables, from dashboards to reports to data apps.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Claude Code: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-claude-code/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-claude-code/</guid><description>
Claude Code is Anthropic&apos;s terminal-based coding agent. It reads your files, writes code, runs commands, and maintains context across a session. Drem...</description><pubDate>Thu, 05 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude Code is Anthropic&apos;s terminal-based coding agent. It reads your files, writes code, runs commands, and maintains context across a session. Dremio is a unified lakehouse platform that gives AI agents three things they need to answer business questions accurately: deep business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them means your coding agent can query live data, validate SQL against real schemas, and generate scripts that actually work against your lakehouse. Without this connection, Claude Code treats Dremio like any other database and often hallucinates function names or syntax. With it, the agent knows your table schemas, your business logic encoded in views, and the correct Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/02/claude-code-dremio-mcp-architecture.png&quot; alt=&quot;Claude Code connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Claude Code&lt;/h2&gt;
&lt;p&gt;If you do not already have Claude Code installed, here is how to get started:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Node.js&lt;/strong&gt; (version 18 or later) from &lt;a href=&quot;https://nodejs.org/&quot;&gt;nodejs.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Claude Code&lt;/strong&gt; globally via npm:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;npm install -g @anthropic-ai/claude-code
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch Claude Code&lt;/strong&gt; by running &lt;code&gt;claude&lt;/code&gt; in your terminal from any project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authenticate&lt;/strong&gt; with your Anthropic API key or Claude Pro/Team subscription on first launch.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Claude Code runs in your terminal and reads your project files for context. It can execute shell commands, edit files, and interact with MCP servers. No IDE or editor is required.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Claude Code supports MCP natively. Connecting them takes about five minutes.&lt;/p&gt;
&lt;p&gt;The fastest path is the &lt;strong&gt;official Dremio plugin for Claude Code&lt;/strong&gt; from the &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;dremio/claude-plugins&lt;/a&gt; repository. This is maintained by Dremio and provides guided setup.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s hosted MCP server uses OAuth for authentication. This means Claude Code connects with your identity and your existing access controls apply to every query the agent runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter an application name (e.g., &amp;quot;Claude Code MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URIs for Claude:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://claude.com/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save the application and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Claude Code&apos;s MCP Client&lt;/h3&gt;
&lt;p&gt;Claude Code reads MCP server definitions from a &lt;code&gt;.mcp.json&lt;/code&gt; file. Create one in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a global configuration that applies across all your projects, place the file at &lt;code&gt;~/.mcp.json&lt;/code&gt; instead.&lt;/p&gt;
&lt;p&gt;Restart Claude Code. The agent now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns an index of available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata for any table or view.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels you have set in the Dremio catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can verify the connection by asking Claude Code: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Official Dremio Plugin for Claude Code&lt;/h3&gt;
&lt;p&gt;Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude Code plugin&lt;/a&gt; that streamlines setup. Install it from the plugin marketplace:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/plugin marketplace add dremio/claude-plugins
/plugin install dremio@dremio-plugins
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in your project directory with your credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DREMIO_PAT=&amp;lt;your_personal_access_token&amp;gt;
DREMIO_PROJECT_ID=&amp;lt;your_project_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add the Dremio MCP server through the &lt;a href=&quot;https://claude.ai&quot;&gt;Claude web interface&lt;/a&gt; under &lt;strong&gt;Customize &amp;gt; Connectors &amp;gt; Add custom connector&lt;/strong&gt;. Claude Code automatically inherits the connection.&lt;/p&gt;
&lt;p&gt;Run &lt;code&gt;/dremio-setup&lt;/code&gt; in Claude Code for step-by-step guidance. The plugin walks you through OAuth configuration, including setting the redirect URI to &lt;code&gt;http://localhost/callback,https://claude.ai/api/mcp/auth_callback&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This is the recommended starting point for Claude Code users because it is officially maintained by Dremio and handles the configuration details for you.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;If you run Dremio Software instead of Dremio Cloud, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; repository:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
uv run dremio-mcp-server config create claude
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The second command writes the MCP server entry directly into Claude&apos;s desktop config. For Claude Code (terminal), add the server to your &lt;code&gt;.mcp.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for exploring and querying data (default)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FOR_SELF&lt;/code&gt; for system introspection and performance analysis&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating Dremio metrics with Prometheus&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Approach 2: Use CLAUDE.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;MCP gives Claude Code live access to your data. But sometimes you need the agent to follow specific conventions, use the right SQL dialect, or know where to find documentation. The MCP connection tells Claude Code what data exists. Context files tell it how your team works with that data.&lt;/p&gt;
&lt;h3&gt;What CLAUDE.md Does&lt;/h3&gt;
&lt;p&gt;Claude Code auto-loads &lt;code&gt;CLAUDE.md&lt;/code&gt; from your project root at the start of every session. It acts as persistent instructions that survive across conversations. You do not need to re-explain your project every time you start a new session.&lt;/p&gt;
&lt;p&gt;The file supports three placement levels. A global &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; applies to every project you open. A project-root &lt;code&gt;CLAUDE.md&lt;/code&gt; applies to that specific repo. And &lt;code&gt;.claude/rules/*.md&lt;/code&gt; files let you split rules into focused modules that are loaded with the same priority. Project-level files override global ones, so you can set organizational defaults and override them per-repo.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused CLAUDE.md&lt;/h3&gt;
&lt;p&gt;Here is an example &lt;code&gt;CLAUDE.md&lt;/code&gt; that teaches Claude Code how to work with Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project Context

This project uses Dremio Cloud as its lakehouse platform.

## Dremio SQL Conventions

- Use `CREATE FOLDER IF NOT EXISTS` (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the built-in Open Catalog use `folder.subfolder.table_name` without a catalog prefix
- External federated sources use `source_name.schema.table_name`
- Cast DATE columns to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials

- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint is in environment variable: DREMIO_URI

## API Reference

- REST API docs: https://docs.dremio.com/current/reference/api/
- SQL reference: https://docs.dremio.com/current/reference/sql/
- For detailed SQL validation, read ./dremio-docs/sql-reference.md

## Terminology

- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; is built on Apache Polaris
- The AI Agent is a co-pilot, not a chatbot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Progressive Disclosure with Supplemental Files&lt;/h3&gt;
&lt;p&gt;Keep &lt;code&gt;CLAUDE.md&lt;/code&gt; under 300 lines. For detailed references, store them in separate files and tell Claude Code where to find them:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Documentation References

- For Dremio SQL syntax details, read `./docs/dremio-sql-reference.md`
- For Python SDK (dremioframe) usage, read `./docs/dremioframe-guide.md`
- For REST API endpoints, read `./docs/dremio-rest-api.md`
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Claude Code only loads these files when it needs them, keeping your context window efficient. You can also instruct the agent explicitly: &amp;quot;Before writing any Dremio SQL, read &lt;code&gt;./docs/dremio-sql-reference.md&lt;/code&gt; to verify syntax.&amp;quot;&lt;/p&gt;
&lt;p&gt;You can also place rule files in &lt;code&gt;.claude/rules/&lt;/code&gt; and they will be auto-loaded with the same priority as &lt;code&gt;CLAUDE.md&lt;/code&gt;. This is useful for separating concerns. For example, &lt;code&gt;.claude/rules/dremio-conventions.md&lt;/code&gt; for SQL rules and &lt;code&gt;.claude/rules/project-style.md&lt;/code&gt; for code style.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/02/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;p&gt;Beyond the official plugin, two community-supported open-source repositories provide ready-made Dremio context for coding agents. Both work with Claude Code.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; The &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;dremio/claude-plugins&lt;/a&gt; plugin is officially maintained by Dremio. The repositories below are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product. Libraries like dremioframe (the Dremio Python SDK referenced in the skill) are also community-supported.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill: Full Agent Skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a complete skill directory that teaches AI assistants how to interact with Dremio.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is included:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dremio-skill/
  SKILL.md          # Entry point defining capabilities
  knowledge/        # Comprehensive docs for:
    cli/            #   Dremio CLI administration
    python/         #   dremioframe Python SDK
    sql/            #   SQL syntax, Iceberg DML, metadata
    rest-api/       #   REST API endpoints
  rules/
    .cursorrules    # Config for Cursor/VS Code
    AGENTS.md       # Config for OpenCode/Codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Installation:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Run the interactive installer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The installer asks you to choose:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Global Install (Symlink)&lt;/strong&gt; symlinks the skill to &lt;code&gt;~/.claude/skills/&lt;/code&gt; so every Claude Code session discovers it automatically. Updates to the cloned repo are reflected immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; copies the skill into your project directory and sets up &lt;code&gt;.claude&lt;/code&gt; symlinks so Claude Code auto-detects it. The skill travels with your repo, so every team member gets the same context.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After installation, start Claude Code and try: &amp;quot;Using the Dremio skill, write a dremioframe script to query my customer table.&amp;quot;&lt;/p&gt;
&lt;h3&gt;dremio-agent-md: Documentation Protocol (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository takes a different approach. Instead of a skill with structured knowledge files, it provides a master protocol file and a browsable sitemap of the entire Dremio documentation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is included:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; defines how the agent should validate SQL, handle security (credentials via &lt;code&gt;.env&lt;/code&gt;), and navigate the documentation.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dremio_sitemaps/&lt;/code&gt; contains hierarchical markdown indices of official Dremio docs for both Cloud and Software versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Usage with Claude Code:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Clone the repo into your project or a reference directory:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then tell Claude Code at the start of your session:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory to understand Dremio protocols. Use the sitemaps in dremio_sitemaps/ to verify any Dremio features or SQL syntax before generating code.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent will navigate the sitemaps to find the correct documentation page for whatever feature you are working with, like looking up the right function signature before writing a query.&lt;/p&gt;
&lt;p&gt;This approach is especially useful when you need Claude Code to validate SQL against the official docs rather than rely on its training data.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own Dremio Skill&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not cover your specific workflow, build a custom skill. A skill is just a directory with a &lt;code&gt;SKILL.md&lt;/code&gt; file and optional supporting docs.&lt;/p&gt;
&lt;h3&gt;Create the Skill Directory&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;my-dremio-skill/
  SKILL.md
  knowledge/
    sql-conventions.md
    rest-api-endpoints.md
    project-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Write SKILL.md&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;SKILL.md&lt;/code&gt; file needs YAML frontmatter for discovery and markdown instructions for the agent:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: My Dremio Skill
description: Custom conventions and API patterns for our team&apos;s Dremio Cloud project
---

# My Dremio Skill

## When to Use

Use this skill when working with Dremio queries, dremioframe scripts,
or any code that interacts with our lakehouse.

## SQL Rules

- All tables live under the `analytics` namespace
- Use `analytics.bronze.*` for raw views, `analytics.silver.*` for joins,
  `analytics.gold.*` for final datasets
- Always use TIMESTAMP, never DATE
- Validate function names against `knowledge/sql-conventions.md`

## Authentication

- Use environment variable DREMIO_PAT for Personal Access Tokens
- Cloud endpoint: Use environment variable DREMIO_URI

## Reference Files

- SQL conventions: knowledge/sql-conventions.md
- REST API: knowledge/rest-api-endpoints.md
- Project schemas: knowledge/project-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Install the Skill&lt;/h3&gt;
&lt;p&gt;For Claude Code, place the skill in one of these locations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.claude/skills/my-dremio-skill/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project-local:&lt;/strong&gt; &lt;code&gt;.claude/skills/my-dremio-skill/&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Claude Code discovers skills by reading their &lt;code&gt;SKILL.md&lt;/code&gt; files. When a user prompt matches the skill description, the agent loads the full instructions automatically.&lt;/p&gt;
&lt;h3&gt;Add Knowledge Files&lt;/h3&gt;
&lt;p&gt;Populate the &lt;code&gt;knowledge/&lt;/code&gt; directory with the specific references your team needs. You might include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your project&apos;s table schemas exported from Dremio&lt;/li&gt;
&lt;li&gt;SQL patterns that are specific to your data model&lt;/li&gt;
&lt;li&gt;dremioframe code snippets for common operations&lt;/li&gt;
&lt;li&gt;REST API call examples with your specific endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The advantage of a custom skill over a generic &lt;code&gt;CLAUDE.md&lt;/code&gt; is discoverability. Skills are loaded on demand based on semantic matching, so they do not consume context tokens until they are needed.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Claude Code: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Claude Code becomes a data engineering partner. Here are detailed examples you can try immediately.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;The simplest and most powerful use case. Ask Claude Code questions in plain English and get answers from production data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by revenue last quarter? Show month-over-month trends.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code uses the MCP connection to discover your tables, writes the SQL, runs it against Dremio, and returns formatted results with analysis. You get answers from production data in seconds without writing a single query yourself.&lt;/p&gt;
&lt;p&gt;You can go deeper with follow-up questions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which of those top 10 customers had declining order frequency? Pull their last 6 months of order data and calculate the trend.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Because Claude Code maintains context across the session, it remembers the previous query results and builds on them. The MCP connection gives it live access to run the follow-up query without you needing to re-explain the schema.&lt;/p&gt;
&lt;p&gt;This pattern turns Claude Code into a conversational analytics tool. Business analysts who are comfortable with English but not SQL can use it to explore data, test hypotheses, and generate insights directly from the lakehouse.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Claude Code to create a complete, self-contained dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer acquisition metrics. Make it filterable by date range and downloadable as PDF.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover your gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries to pull the relevant data&lt;/li&gt;
&lt;li&gt;Generate an HTML file with embedded CSS, JavaScript, and Chart.js configurations&lt;/li&gt;
&lt;li&gt;Embed the query results directly into the JavaScript as data arrays&lt;/li&gt;
&lt;li&gt;Add interactive filters and a print-to-PDF button&lt;/li&gt;
&lt;li&gt;Save everything to your project directory&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser and you have an interactive dashboard running from a local file. No server, no deployment, no infrastructure. Share it with your team by dropping it in Slack or email.&lt;/p&gt;
&lt;p&gt;For recurring dashboards, save the prompt in a script and re-run it weekly to regenerate the dashboard with fresh data from Dremio.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build a more interactive tool for ongoing data exploration:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Streamlit app that uses dremioframe to connect to Dremio. Include a schema browser sidebar, a data preview tab with pagination, and a SQL query editor with results. Add download buttons for CSV export and a query history panel.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code writes the full Python application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout, dremioframe connection, and query execution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; showing required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;pip install -r requirements.txt &amp;amp;&amp;amp; streamlit run app.py&lt;/code&gt; and you have a local data exploration tool connected to your lakehouse. Your team can use it for ad-hoc analysis without needing direct access to the Dremio UI.&lt;/p&gt;
&lt;p&gt;This pattern works well for creating internal tools quickly. Instead of waiting for a formal BI tool deployment, you can have a working data explorer in minutes.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a dremioframe script that reads new CSV files from the staging folder, creates bronze views in Dremio, builds silver views with data quality validations (null checks, type casting, deduplication), and creates gold views with business logic aggregations. Include error handling, logging, and a dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code uses the Dremio skill to write production-quality pipeline code that follows Medallion Architecture conventions. The script includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bronze layer: raw data ingestion with column renames and TIMESTAMP casts&lt;/li&gt;
&lt;li&gt;Silver layer: data quality rules, deduplication, and join logic&lt;/li&gt;
&lt;li&gt;Gold layer: business metric aggregations and CASE WHEN classifications&lt;/li&gt;
&lt;li&gt;Error handling with retry logic for transient Dremio connection issues&lt;/li&gt;
&lt;li&gt;Structured logging for pipeline monitoring&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create a REST API that serves lakehouse data to other applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for: GET /api/customers (paginated), GET /api/customers/{id}/orders, GET /api/analytics/revenue?period=monthly. Add request validation, error handling, and OpenAPI documentation.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code generates a complete API server with typed request/response models, query parameterization to prevent SQL injection, and auto-generated Swagger docs. Deploy it locally or containerize it for production use.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time data access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, doc references, credential rules&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards or project conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Getting started quickly with broad Dremio coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Skill&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored to your exact schemas, patterns, and workflows&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These approaches are not mutually exclusive. A common setup combines the MCP server for live data access with a custom &lt;code&gt;CLAUDE.md&lt;/code&gt; for project conventions. Or start with the pre-built &lt;code&gt;dremio-agent-skill&lt;/code&gt; and add a &lt;code&gt;CLAUDE.md&lt;/code&gt; for your team-specific overrides.&lt;/p&gt;
&lt;p&gt;The strongest configuration uses all four layers: MCP for live connectivity, CLAUDE.md for project rules, a pre-built skill for general Dremio knowledge, and custom knowledge files for your specific schemas and patterns.&lt;/p&gt;
&lt;p&gt;If you are evaluating Dremio for the first time, start with the MCP server alone. It takes five minutes and gives you immediate value. As your usage matures and you need the agent to follow team conventions or validate against specific documentation, layer in the context files and skills.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to Claude Code&apos;s &lt;code&gt;.mcp.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Start Claude Code and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Claude Code what it needs to write accurate SQL: the semantic layer provides business context, query federation provides universal data access, and Reflections provide interactive speed. The MCP server is the bridge that connects them.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Amazon Kiro: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-amazon-kiro/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-amazon-kiro/</guid><description>
Amazon Kiro is an agentic AI IDE from AWS that introduces spec-driven development to the coding workflow. Instead of jumping straight to code, Kiro h...</description><pubDate>Thu, 05 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Amazon Kiro is an agentic AI IDE from AWS that introduces spec-driven development to the coding workflow. Instead of jumping straight to code, Kiro helps you define structured specifications :  requirements, technical designs, and task breakdowns ,  before writing a single line. It then generates code that follows those specs and keeps everything in sync as the project evolves. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Kiro&apos;s agent the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. Kiro&apos;s spec-driven approach is especially well-suited for data projects: you can define your data model requirements in plain language, let Kiro generate the technical design, and then have it build the implementation with full traceability back to the original requirements.&lt;/p&gt;
&lt;p&gt;Kiro&apos;s hooks system adds event-driven automation, so documentation, tests, and validation can update automatically as your Dremio code changes.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/01/kiro-dremio-architecture.png&quot; alt=&quot;Amazon Kiro AI IDE connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Amazon Kiro&lt;/h2&gt;
&lt;p&gt;If you do not already have Kiro installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Kiro&lt;/strong&gt; from &lt;a href=&quot;https://kiro.dev/&quot;&gt;kiro.dev&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your AWS account, Google account, or GitHub account.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting File &amp;gt; Open Folder and pointing to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explore the interface&lt;/strong&gt; : Kiro includes a file explorer, an AI chat panel, a specs panel for viewing requirements/design/tasks, and a hooks panel for event-driven automations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Kiro is built on the VS Code platform, so existing VS Code extensions and themes are compatible. It is free to use during the preview period.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. Kiro supports MCP natively and integrates deeply with AWS MCP servers.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Kiro, you configure the MCP connection through the IDE settings or project configuration.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Kiro MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Kiro&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;In Kiro, open the MCP settings and add a new server. You can configure via the settings UI or create a &lt;code&gt;.kiro/mcp.json&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Kiro now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls catalog descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test by asking the AI chat: &amp;quot;What tables are available in Dremio?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Kiro Powers&lt;/h3&gt;
&lt;p&gt;Kiro supports &amp;quot;Powers&amp;quot; : curated bundles of MCP servers, steering files, and hooks for specific development workflows. If an AWS or community Dremio Power becomes available, you can install it from the Powers panel to get a pre-configured Dremio development environment.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, configure the dremio-mcp server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;,
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;,
        &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Kiro Specs for Dremio Context&lt;/h2&gt;
&lt;p&gt;Kiro&apos;s spec-driven development is its most distinctive feature. Instead of free-form context files, Kiro uses structured specification documents that the AI generates and maintains.&lt;/p&gt;
&lt;h3&gt;Generating Specs for a Dremio Project&lt;/h3&gt;
&lt;p&gt;Tell Kiro to create specs for your data project:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;I need a data analytics pipeline that reads from Dremio&apos;s lakehouse, transforms the data using a Medallion Architecture, and serves results through a REST API.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates three spec files in &lt;code&gt;.kiro/specs/&lt;/code&gt;:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;requirements.md&lt;/strong&gt; : User stories in structured format:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;1. As a data engineer, I want to ingest raw data from Dremio bronze tables
   so that I can process it through the pipeline.
2. As a data analyst, I want cleaned data in gold views
   so that I can run accurate business queries.
3. As an application developer, I want REST endpoints over gold data
   so that I can build dashboards and reports.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;design.md&lt;/strong&gt; : Technical design covering architecture, data flow, table schemas, and technology choices.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tasks.md&lt;/strong&gt; : A breakdown of implementation tasks that Kiro tracks as you build.&lt;/p&gt;
&lt;h3&gt;Adding Dremio Conventions to Specs&lt;/h3&gt;
&lt;p&gt;You can refine the generated specs with Dremio-specific conventions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Update the design to use Dremio SQL conventions: CREATE FOLDER IF NOT EXISTS, folder.subfolder.table_name paths, TIMESTAMPDIFF for durations. Use dremioframe for Python connections and environment variables for credentials.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro updates the design.md and tasks.md to reflect these conventions. All code generated from these specs will follow the conventions automatically.&lt;/p&gt;
&lt;h3&gt;Steering Files&lt;/h3&gt;
&lt;p&gt;Kiro also supports steering files : markdown documents that provide persistent context similar to &lt;code&gt;.cursorrules&lt;/code&gt; or &lt;code&gt;CLAUDE.md&lt;/code&gt;. Create a &lt;code&gt;.kiro/steering/dremio.md&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio Conventions

## SQL

- Use CREATE FOLDER IF NOT EXISTS
- Tables: folder.subfolder.table_name
- Cast DATE to TIMESTAMP for joins
- Use TIMESTAMPDIFF for durations

## Credentials

- DREMIO_PAT and DREMIO_URI from environment variables
- Never hardcode tokens

## Terminology

- &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/01/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the knowledge directory into your project and reference it in Kiro&apos;s steering files.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in &lt;code&gt;.kiro/steering/dremio.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For SQL validation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Custom Specs and Hooks&lt;/h2&gt;
&lt;p&gt;Kiro&apos;s hooks system offers a unique approach to maintaining data project consistency.&lt;/p&gt;
&lt;h3&gt;Creating Dremio Hooks&lt;/h3&gt;
&lt;p&gt;Hooks are event-driven automations that trigger when files change. Create hooks that automatically validate Dremio SQL:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On SQL file save&lt;/strong&gt; : A hook that validates SQL syntax against Dremio conventions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a hook that triggers when any .sql file is saved. It should read the file, validate that it uses CREATE FOLDER IF NOT EXISTS instead of CREATE SCHEMA, checks for proper table path formatting, and flags any deprecated function names.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;On pipeline code change&lt;/strong&gt; : A hook that updates tests:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a hook that triggers when any Python file in the pipelines/ directory changes. It should update the corresponding test file to match the new pipeline logic, using dremioframe mocking patterns.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Hooks keep your Dremio project self-maintaining. As code changes, documentation and tests update automatically.&lt;/p&gt;
&lt;h3&gt;Custom Steering Files&lt;/h3&gt;
&lt;p&gt;Create comprehensive steering files in &lt;code&gt;.kiro/steering/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.kiro/steering/
  dremio-sql.md        # SQL conventions
  dremio-python.md     # dremioframe patterns
  dremio-schemas.md    # Team table schemas
  dremio-pipeline.md   # Pipeline architecture rules
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These files are loaded into every Kiro interaction and ensure consistent code generation.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Kiro: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Kiro&apos;s spec-driven approach creates traceable, well-documented data projects.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;In the chat panel:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Show growth rates and compare to the same period last year.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro uses MCP to discover tables, writes SQL, and returns results. Unlike other tools, Kiro can also generate a spec that documents the analysis methodology for reproducibility.&lt;/p&gt;
&lt;p&gt;Follow up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the declining products, pull customer sentiment from support tickets. Is there a correlation between product issues and revenue decline?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro tracks the analytical thread and can generate a formal analysis spec for the investigation.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Start with specs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;I need a self-contained HTML dashboard showing Dremio gold-layer metrics: revenue trends, customer acquisition, and regional performance. Spec it out first, then build it.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates the requirements, design, and tasks first, then builds the dashboard following the specs. Every file traces back to a requirement, making it easy to review and maintain.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Spec-driven app development:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Spec and build a Streamlit app connected to Dremio via dremioframe. Requirements: schema browsing, SQL query editor, data preview, CSV export. Generate all files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro creates the full spec, then generates the application. The tasks.md tracks progress, and hooks can keep tests updated as you iterate.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Spec-driven data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Spec a Medallion Architecture pipeline for product_events. Requirements: bronze ingestion, silver cleaning, gold aggregation. Design should use dremioframe and follow our SQL conventions. Then implement it.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates the full spec suite (requirements, design, tasks), then writes the pipeline code. Every transformation traces back to a requirement, and hooks validate the SQL on every save.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Spec-driven API development:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Spec and build a FastAPI service over Dremio gold-layer views. Requirements: customer analytics, revenue data, product metrics. Design should include Pydantic models and caching.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates the complete API with full traceability to the requirements.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kiro Specs&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Structured requirements, design, traceable implementation&lt;/td&gt;
&lt;td&gt;Teams valuing documentation and traceability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Hooks&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Event-driven validation, auto-updating tests and docs&lt;/td&gt;
&lt;td&gt;Mature teams with CI-like automation needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for live data access. Use Kiro&apos;s spec-driven flow for any project beyond a quick query. Add hooks for automated validation as your project matures.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Kiro&apos;s MCP settings or &lt;code&gt;.kiro/mcp.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Tell Kiro to generate specs for your Dremio data project.&lt;/li&gt;
&lt;li&gt;Let Kiro build the implementation from the specs.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Kiro accurate data context, and Kiro&apos;s spec-driven methodology ensures every line of generated code traces back to a documented requirement. This is especially valuable for data engineering, where auditability and traceability matter.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Apache Druid to Dremio Cloud: Add SQL Joins, AI, and Governance to Your Real-Time Analytics</title><link>https://iceberglakehouse.com/posts/2026-03-connector-apache-druid/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-apache-druid/</guid><description>
Apache Druid is a real-time analytics database designed for sub-second queries on high-ingestion-rate event data. Clickstream analytics, application ...</description><pubDate>Sun, 01 Mar 2026 23:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Apache Druid is a real-time analytics database designed for sub-second queries on high-ingestion-rate event data. Clickstream analytics, application monitoring, IoT telemetry, and ad-tech workloads rely on Druid&apos;s columnar storage and inverted indexes for instantaneous queries.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Druid as a federated data source, giving you the ability to join Druid event data with relational databases, data lakes, and cloud warehouses. Dremio adds governance (column masking, row-level filtering), Reflection-based acceleration, and AI capabilities (AI Agent, MCP Server, AI SQL Functions) that Druid doesn&apos;t provide natively.&lt;/p&gt;
&lt;p&gt;Druid excels at one thing: fast aggregation queries on time-series event data. But production analytics rarely involve just one data source. When a product manager asks &amp;quot;Show me user engagement metrics correlated with support ticket volume and revenue impact,&amp;quot; that query requires joining Druid&apos;s event data with a CRM database and a financial system. Druid can&apos;t do these joins natively : it doesn&apos;t support standard SQL JOINs. Dremio bridges this gap by reading Druid data and joining it with any other source in a single SQL query.&lt;/p&gt;
&lt;p&gt;But Druid has fundamental limitations that become painful as analytics needs grow. It doesn&apos;t support traditional SQL joins between datasources. It doesn&apos;t connect to external databases. Its query model is optimized for aggregations on its own ingested segments, not for the kind of cross-source, enriched analytics modern organizations need.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Apache Druid and queries it alongside relational databases, data lakes, and cloud warehouses. You get the speed of Druid for real-time aggregations combined with Dremio&apos;s ability to join that data with any other source, accelerate queries with Reflections, apply governance, and enable AI analytics.&lt;/p&gt;
&lt;h2&gt;Why Druid Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;SQL Joins with Real-Time Data&lt;/h3&gt;
&lt;p&gt;Druid doesn&apos;t support traditional SQL joins between datasources. If you want to answer &amp;quot;What is the conversion rate by customer segment in the last hour?&amp;quot; you need the real-time event data from Druid and the customer segment data from your CRM database. Without Dremio, you&apos;d need to either pre-join the data before ingesting into Druid (losing flexibility) or build application code that queries both systems and merges results in memory.&lt;/p&gt;
&lt;p&gt;Dremio queries Druid for its real-time aggregations and joins the results with PostgreSQL customer data, S3 behavior logs, Snowflake revenue data, or any other connected source : all in a single SQL query.&lt;/p&gt;
&lt;h3&gt;Enrich Real-Time Metrics with Business Context&lt;/h3&gt;
&lt;p&gt;Druid provides fast counts, averages, percentiles, and approximate distinct counts on event data. But enriching those metrics with customer names, product descriptions, geographic hierarchies, or organizational data requires joining with dimensional data that lives in other systems.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s federation provides that enrichment without duplicating dimensional data into Druid. Your Druid segments stay lean (just events), and Dremio handles the enrichment at query time.&lt;/p&gt;
&lt;h3&gt;Historical Analysis Across Time Ranges&lt;/h3&gt;
&lt;p&gt;Druid is optimized for recent data (hot segments). Historical analysis across months or years :  trend analysis, year-over-year comparisons ,  often hits cold segments that are slower to query. Dremio&apos;s Reflections cache aggregated historical results, providing fast access to time-series trends without depending on Druid&apos;s tiered storage.&lt;/p&gt;
&lt;h3&gt;Unified Governance&lt;/h3&gt;
&lt;p&gt;Druid has basic authentication but limited access control. There&apos;s no column masking, no row-level filtering, no consistent policy framework. Dremio&apos;s Fine-Grained Access Control adds these capabilities, ensuring that sensitive event data (user IDs, IP addresses, location data) is properly governed across Druid and every other connected source.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Druid Broker hostname or IP address&lt;/strong&gt; : the Broker node handles query routing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; : typically &lt;code&gt;8082&lt;/code&gt; for the Broker HTTP API&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; from Dremio Cloud to the Druid Broker&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-apache-druid-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Druid to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Apache Druid&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;druid-realtime&lt;/code&gt; or &lt;code&gt;event-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Druid Broker hostname or IP.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;8082&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Configure credentials if your Druid deployment requires authentication.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh, Metadata refresh intervals, and any connection properties.&lt;/p&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;h2&gt;Query Real-Time Druid Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Real-time page view metrics
SELECT
  DATE_TRUNC(&apos;hour&apos;, __time) AS event_hour,
  page,
  COUNT(*) AS page_views,
  COUNT(DISTINCT user_id) AS unique_visitors,
  ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT user_id), 2) AS views_per_visitor
FROM &amp;quot;druid-realtime&amp;quot;.druid.pageviews
WHERE __time &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
GROUP BY 1, 2
ORDER BY page_views DESC
LIMIT 20;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate: Enrich Real-Time Data with Business Context&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Druid real-time events with PostgreSQL user segments and S3 product data
SELECT
  d.event_hour,
  c.user_segment,
  p.product_category,
  SUM(d.page_views) AS total_views,
  COUNT(DISTINCT d.user_id) AS unique_users,
  CASE
    WHEN c.user_segment = &apos;Enterprise&apos; THEN ROUND(SUM(d.page_views) * 2.5, 2)
    WHEN c.user_segment = &apos;Pro&apos; THEN ROUND(SUM(d.page_views) * 1.5, 2)
    ELSE ROUND(SUM(d.page_views) * 0.5, 2)
  END AS estimated_value
FROM (
  SELECT
    DATE_TRUNC(&apos;hour&apos;, __time) AS event_hour,
    user_id,
    page,
    COUNT(*) AS page_views
  FROM &amp;quot;druid-realtime&amp;quot;.druid.pageviews
  WHERE __time &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
  GROUP BY 1, 2, 3
) d
LEFT JOIN &amp;quot;postgres-crm&amp;quot;.public.users c ON d.user_id = c.user_id
LEFT JOIN &amp;quot;s3-catalog&amp;quot;.products.page_mappings p ON d.page = p.page_url
GROUP BY d.event_hour, c.user_segment, p.product_category
ORDER BY estimated_value DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Druid handles the real-time event aggregation, PostgreSQL provides user context, S3 maps pages to products, and Dremio joins everything.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over Real-Time Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.realtime_engagement AS
SELECT
  DATE_TRUNC(&apos;hour&apos;, __time) AS event_hour,
  page,
  COUNT(*) AS page_views,
  COUNT(DISTINCT user_id) AS unique_visitors,
  ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT user_id), 2) AS views_per_visitor,
  CASE
    WHEN COUNT(*) &amp;gt; 10000 THEN &apos;Trending&apos;
    WHEN COUNT(*) &amp;gt; 1000 THEN &apos;Active&apos;
    WHEN COUNT(*) &amp;gt; 100 THEN &apos;Normal&apos;
    ELSE &apos;Low Traffic&apos;
  END AS traffic_tier
FROM &amp;quot;druid-realtime&amp;quot;.druid.pageviews
WHERE __time &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;7&apos; DAY
GROUP BY 1, 2;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like &amp;quot;realtime_engagement: Hourly page view metrics from the real-time clickstream, classified by traffic tier.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Real-Time Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions about real-time event data in plain English. Instead of writing complex time-window SQL, a product manager asks &amp;quot;Which pages are trending in the last 6 hours?&amp;quot; or &amp;quot;What&apos;s the average engagement per visitor for enterprise users today?&amp;quot; The Agent reads your wiki descriptions and generates accurate SQL.&lt;/p&gt;
&lt;p&gt;This is particularly valuable for Druid data because time-series queries can be complex : date truncation, windowing, and aggregation syntax varies. The AI Agent handles this complexity automatically.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Dremio data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A marketing team lead can ask Claude &amp;quot;Show me our highest-traffic pages from Druid data in the last 24 hours, broken down by user segment&amp;quot; and get real-time insights without writing SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI to classify and analyze real-time event patterns:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify page traffic patterns with AI
SELECT
  page,
  page_views,
  unique_visitors,
  AI_CLASSIFY(
    &apos;Based on this web traffic pattern, classify the likely content type&apos;,
    &apos;Page: &apos; || page || &apos;, Views: &apos; || CAST(page_views AS VARCHAR) || &apos;, Unique visitors: &apos; || CAST(unique_visitors AS VARCHAR) || &apos;, Views per visitor: &apos; || CAST(views_per_visitor AS VARCHAR),
    ARRAY[&apos;Product Page&apos;, &apos;Blog Content&apos;, &apos;Landing Page&apos;, &apos;Documentation&apos;, &apos;Support&apos;]
  ) AS inferred_content_type
FROM analytics.gold.realtime_engagement
WHERE traffic_tier = &apos;Trending&apos;;

-- Generate real-time traffic summaries
SELECT
  event_hour,
  AI_GENERATE(
    &apos;Write a brief traffic summary for this hour&apos;,
    &apos;Hour: &apos; || CAST(event_hour AS VARCHAR) || &apos;, Total Views: &apos; || CAST(SUM(page_views) AS VARCHAR) || &apos;, Unique Visitors: &apos; || CAST(SUM(unique_visitors) AS VARCHAR) || &apos;, Trending Pages: &apos; || CAST(COUNT(CASE WHEN traffic_tier = &apos;Trending&apos; THEN 1 END) AS VARCHAR)
  ) AS hourly_summary
FROM analytics.gold.realtime_engagement
GROUP BY event_hour
ORDER BY event_hour DESC
LIMIT 24;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;For historical aggregations over Druid data, create Reflections:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view that aggregates Druid data by day/hour/week&lt;/li&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt; and click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : for real-time data, hourly; for historical trends, daily&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard queries for &amp;quot;last 30 days&amp;quot; or &amp;quot;year-over-year&amp;quot; hit the Reflection instead of scanning Druid&apos;s cold segments. Real-time queries for &amp;quot;last hour&amp;quot; still go directly to Druid for sub-second latency.&lt;/p&gt;
&lt;h2&gt;Governance on Real-Time Data&lt;/h2&gt;
&lt;p&gt;Druid has basic authentication but no column masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask user IDs, IP addresses, and location data from specific roles. A product manager sees engagement metrics but not individual user data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict real-time data access by team or region. A regional marketing team sees only their region&apos;s clickstream.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Druid, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access to real-time dashboards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to event data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on event data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Druid data from their IDE. Ask Copilot &amp;quot;Show me trending pages from Druid in the last 6 hours&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Druid vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Druid:&lt;/strong&gt; Real-time event streams that need sub-second query latency, high-ingestion-rate data (thousands of events per second), data that powers real-time operational dashboards with sub-second SLAs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical event archives older than 30-90 days, data that needs SQL joins (Druid can&apos;t do them natively), analytics that combine events with dimensional data, data consumed by BI tools that expect standard SQL, archival data for compliance and auditing.&lt;/p&gt;
&lt;p&gt;For active Druid data, create manual Reflections with refresh schedules that balance freshness and performance. For migrated Iceberg data in Dremio&apos;s Open Catalog, you get automated compaction, Autonomous Reflections, and significantly lower storage costs.&lt;/p&gt;
&lt;h2&gt;Real-Time Tiering Strategy&lt;/h2&gt;
&lt;p&gt;Combine Druid&apos;s real-time capabilities with Dremio&apos;s historical analysis:&lt;/p&gt;
&lt;h3&gt;Tier 1: Real-Time (Druid : 0 to 24 hours)&lt;/h3&gt;
&lt;p&gt;Druid ingests and serves sub-second queries on live event data. Dremio queries Druid directly for &amp;quot;last hour&amp;quot; or &amp;quot;last 6 hours&amp;quot; dashboards.&lt;/p&gt;
&lt;h3&gt;Tier 2: Recent Historical (Iceberg : 1 to 90 days)&lt;/h3&gt;
&lt;p&gt;Daily batch jobs move yesterday&apos;s data from Druid into Iceberg tables in Dremio&apos;s Open Catalog. Analytical queries for &amp;quot;last 30 days&amp;quot; hit Iceberg tables with Autonomous Reflections.&lt;/p&gt;
&lt;h3&gt;Tier 3: Long-Term Archive (Iceberg : 90+ days)&lt;/h3&gt;
&lt;p&gt;Older data stays in Iceberg cold storage (S3 Infrequent Access). Compliance and audit queries use time travel against archived snapshots.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Dremio view that combines real-time and historical data
CREATE VIEW analytics.gold.unified_events AS
SELECT event_type, user_id, event_timestamp, &apos;real-time&apos; AS data_tier
FROM &amp;quot;druid-cluster&amp;quot;.clickstream.events
WHERE event_timestamp &amp;gt;= CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
UNION ALL
SELECT event_type, user_id, event_timestamp, &apos;historical&apos; AS data_tier
FROM analytics.silver.events_archive
WHERE event_timestamp &amp;lt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
  AND event_timestamp &amp;gt;= CURRENT_TIMESTAMP - INTERVAL &apos;90&apos; DAY;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Event Pipeline Integration&lt;/h2&gt;
&lt;p&gt;Common Druid deployment patterns that work with Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Kafka → Druid → Dremio:&lt;/strong&gt; Real-time events flow through Kafka into Druid. Dremio queries Druid for analytics and joins with slow-changing dimensional data from PostgreSQL or S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kafka → Druid + S3:&lt;/strong&gt; Events land in both Druid (real-time) and S3 (archive). Dremio queries both seamlessly through federation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kinesis → Druid → Dremio:&lt;/strong&gt; AWS-native pattern where Kinesis streams feed Druid, and Dremio provides multi-source analytics over streamed data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Apache Druid users can extend their real-time analytics with cross-source joins, AI-powered insights, enterprise governance, and Reflection-based acceleration : all through Dremio Cloud.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-apache-druid-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Druid cluster.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect MongoDB to Dremio Cloud: SQL Analytics on Document Data</title><link>https://iceberglakehouse.com/posts/2026-03-connector-mongodb/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-mongodb/</guid><description>
MongoDB is the most popular NoSQL document database. It stores data in flexible JSON-like documents, making it ideal for applications with evolving s...</description><pubDate>Sun, 01 Mar 2026 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;MongoDB is the most popular NoSQL document database. It stores data in flexible JSON-like documents, making it ideal for applications with evolving schemas : user profiles, product catalogs, IoT sensor data, and content management systems. But MongoDB&apos;s document model creates analytics challenges: you can&apos;t run SQL joins natively, aggregation pipelines are complex, and connecting MongoDB data to relational sources requires custom application code or ETL.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to MongoDB and exposes its collections as SQL-queryable tables. Nested documents appear as structured columns, and you can join MongoDB data with relational databases, data lakes, and cloud warehouses using standard SQL.&lt;/p&gt;
&lt;h2&gt;Why MongoDB Users Need Dremio&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;SQL on documents.&lt;/strong&gt; MongoDB&apos;s query language (MQL) is powerful but different from SQL. Your analysts know SQL. Dremio transforms MongoDB collections into SQL-queryable tables, so analysts don&apos;t need to learn MQL or write aggregation pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Join documents with relational data.&lt;/strong&gt; Your user profiles are in MongoDB, your order data is in PostgreSQL, and your marketing data is in S3. Without Dremio, combining these requires application code that queries each system separately and merges results in memory. Dremio federates all three in a single SQL query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Flatten nested structures.&lt;/strong&gt; MongoDB documents often contain nested objects and arrays. Dremio&apos;s &lt;code&gt;FLATTEN&lt;/code&gt; function expands arrays into rows, and nested objects become addressable columns (e.g., &lt;code&gt;address.city&lt;/code&gt;, &lt;code&gt;preferences.theme&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistent governance.&lt;/strong&gt; MongoDB has authentication and roles, but they don&apos;t extend to other data sources. Dremio&apos;s FGAC applies consistent column masking and row filtering across MongoDB and all other connected sources.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI analytics.&lt;/strong&gt; MongoDB&apos;s unstructured nature makes it difficult for AI tools to query directly. Dremio&apos;s semantic layer creates structured views with business context, enabling the AI Agent to answer questions about MongoDB data.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MongoDB hostname or IP address&lt;/strong&gt; (or MongoDB Atlas connection string)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; : default &lt;code&gt;27017&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name(s)&lt;/strong&gt; : MongoDB databases you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; (with &lt;code&gt;read&lt;/code&gt; role on target databases)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : port 27017 open to Dremio Cloud. For MongoDB Atlas, add Dremio&apos;s IP range to the Atlas IP Access List&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mongodb-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Connect MongoDB to Dremio Cloud&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;MongoDB&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter &lt;strong&gt;Name&lt;/strong&gt;, &lt;strong&gt;Host&lt;/strong&gt;, &lt;strong&gt;Port&lt;/strong&gt; (27017).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication Type:&lt;/strong&gt; Choose Standard (username/password) or No Authentication.&lt;/li&gt;
&lt;li&gt;Configure &lt;strong&gt;Advanced Options&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use SSL:&lt;/strong&gt; Enable for MongoDB Atlas or SSL-configured instances.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auth Database:&lt;/strong&gt; The database used for authentication (default: &lt;code&gt;admin&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read preference:&lt;/strong&gt; Control whether queries hit primary or secondary replicas (&lt;code&gt;primary&lt;/code&gt;, &lt;code&gt;primaryPreferred&lt;/code&gt;, &lt;code&gt;secondary&lt;/code&gt;, &lt;code&gt;secondaryPreferred&lt;/code&gt;, &lt;code&gt;nearest&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subpartition size:&lt;/strong&gt; Controls how Dremio partitions large collections for parallel reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Configure &lt;strong&gt;Reflection Refresh&lt;/strong&gt; and &lt;strong&gt;Metadata&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;Privileges&lt;/strong&gt; and &lt;strong&gt;Save&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Query MongoDB Data with SQL&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query a MongoDB collection as a SQL table
SELECT user_id, name, email, signup_date
FROM &amp;quot;mongo-users&amp;quot;.app.users
WHERE signup_date &amp;gt; &apos;2024-01-01&apos;
ORDER BY signup_date DESC;

-- Access nested fields
SELECT
  user_id,
  name,
  address.city AS city,
  address.state AS state,
  preferences.theme AS ui_theme
FROM &amp;quot;mongo-users&amp;quot;.app.users
WHERE address.state = &apos;CA&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Flatten Nested Arrays&lt;/h2&gt;
&lt;p&gt;MongoDB documents frequently contain arrays. Use &lt;code&gt;FLATTEN&lt;/code&gt; to expand them into rows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- If each user document has an orders array
SELECT
  u.user_id,
  u.name,
  o.order_id,
  o.total_amount,
  o.order_date
FROM &amp;quot;mongo-users&amp;quot;.app.users u,
  FLATTEN(u.orders) AS t(o)
WHERE o.total_amount &amp;gt; 100
ORDER BY o.order_date DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate MongoDB with Relational Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join MongoDB user profiles with PostgreSQL orders and S3 analytics
SELECT
  m.name AS customer_name,
  m.address.city AS city,
  COUNT(pg.order_id) AS total_orders,
  SUM(pg.amount) AS total_spent,
  COUNT(s3.event_id) AS engagement_events
FROM &amp;quot;mongo-users&amp;quot;.app.users m
LEFT JOIN &amp;quot;postgres-orders&amp;quot;.public.orders pg ON m.user_id = pg.customer_id
LEFT JOIN &amp;quot;s3-events&amp;quot;.clickstream.events s3 ON m.user_id = s3.user_id
GROUP BY m.name, m.address.city
ORDER BY total_spent DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_profile AS
SELECT
  m.user_id,
  m.name,
  m.email,
  m.address.city AS city,
  m.address.state AS state,
  m.signup_date,
  CASE
    WHEN m.subscription.tier = &apos;premium&apos; THEN &apos;Premium&apos;
    WHEN m.subscription.tier = &apos;pro&apos; THEN &apos;Pro&apos;
    ELSE &apos;Free&apos;
  END AS subscription_tier
FROM &amp;quot;mongo-users&amp;quot;.app.users m;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), go to the &lt;strong&gt;Details&lt;/strong&gt; tab, and click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Dremio&apos;s generative AI samples the view schema and data to produce descriptions like: &amp;quot;customer_profile: Contains one row per user combining profile data from MongoDB with subscription tier classification.&amp;quot; Review and refine these descriptions : add business context like &amp;quot;Premium subscribers qualify for the dedicated support tier and priority feature access.&amp;quot;&lt;/p&gt;
&lt;p&gt;These wikis and labels are the context that powers Dremio&apos;s AI capabilities.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on MongoDB Data&lt;/h2&gt;
&lt;p&gt;MongoDB&apos;s flexible document model makes it notoriously difficult for AI tools to query directly : nested objects, variable schemas, and BSON types create barriers. Dremio&apos;s semantic layer solves this by creating structured, well-documented views over MongoDB data that AI tools can understand and query accurately.&lt;/p&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions about MongoDB data in plain English. Instead of learning MongoDB&apos;s aggregation framework or SQL with nested field syntax, a product manager asks &amp;quot;How many Premium subscribers are in California?&amp;quot; and the Agent generates the correct SQL using your semantic layer.&lt;/p&gt;
&lt;p&gt;The Agent reads the wiki descriptions you attached to views to understand what &amp;quot;Premium&amp;quot; means in your data (subscription.tier = &apos;premium&apos;), what &amp;quot;California&amp;quot; maps to (address.state = &apos;CA&apos;), and which view to query. Better wikis produce more accurate AI responses.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities to external chat clients. Connect Claude or ChatGPT to your MongoDB data through the hosted MCP Server with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude, &lt;code&gt;https://chatgpt.com/connector_platform_oauth_redirect&lt;/code&gt; for ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now your team can ask Claude &amp;quot;Show me user growth trends by subscription tier from MongoDB data&amp;quot; and get governed, accurate results : without knowing MongoDB query syntax or SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use Dremio&apos;s built-in AI SQL functions to enrich MongoDB data directly in queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify users based on their MongoDB profile data
SELECT
  name,
  subscription_tier,
  city,
  state,
  AI_CLASSIFY(
    &apos;Based on this user profile, classify their likely engagement level&apos;,
    &apos;Name: &apos; || name || &apos;, Subscription: &apos; || subscription_tier || &apos;, City: &apos; || city || &apos;, State: &apos; || state,
    ARRAY[&apos;Highly Engaged&apos;, &apos;Active&apos;, &apos;At Risk&apos;, &apos;Churned&apos;]
  ) AS engagement_prediction
FROM analytics.gold.customer_profile
WHERE subscription_tier IN (&apos;Premium&apos;, &apos;Pro&apos;);

-- Generate personalized outreach messages
SELECT
  name,
  subscription_tier,
  AI_GENERATE(
    &apos;Write a one-sentence personalized upgrade message for this user&apos;,
    &apos;User: &apos; || name || &apos;, Current Tier: &apos; || subscription_tier || &apos;, Location: &apos; || city || &apos;, &apos; || state
  ) AS upgrade_message
FROM analytics.gold.customer_profile
WHERE subscription_tier = &apos;Free&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; categorizes users based on profile attributes. &lt;code&gt;AI_GENERATE&lt;/code&gt; creates personalized text. Both run inline in your SQL queries, enriching MongoDB data with AI in real time.&lt;/p&gt;
&lt;h2&gt;Accelerate MongoDB Analytics with Reflections&lt;/h2&gt;
&lt;p&gt;MongoDB isn&apos;t designed for heavy analytical workloads. Running 50 dashboard queries per hour against MongoDB competes with your application&apos;s read/write operations. Create Reflections on your MongoDB views to cache results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the Catalog&lt;/li&gt;
&lt;li&gt;Create a Reflection with the columns and aggregations used most&lt;/li&gt;
&lt;li&gt;Set the refresh interval (e.g., every 30 minutes for near-real-time, hourly for daily reporting)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected to Dremio via Arrow Flight or ODBC get sub-second response times from Reflections : MongoDB handles zero analytical load.&lt;/p&gt;
&lt;h2&gt;MongoDB-Specific Considerations&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Schema sampling.&lt;/strong&gt; MongoDB is schema-less : each document can have different fields. Dremio samples documents to infer the schema. If your documents have highly variable schemas, some fields might not appear until more documents are sampled. You can increase the sample size in the source configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read preference.&lt;/strong&gt; For MongoDB replica sets, use &lt;code&gt;secondaryPreferred&lt;/code&gt; to route analytical queries to secondary replicas, avoiding impact on your primary node&apos;s CRUD operations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data types.&lt;/strong&gt; MongoDB&apos;s BSON types map to Dremio types: &lt;code&gt;ObjectID&lt;/code&gt; → &lt;code&gt;VARCHAR&lt;/code&gt;, &lt;code&gt;NumberLong&lt;/code&gt; → &lt;code&gt;BIGINT&lt;/code&gt;, &lt;code&gt;NumberInt&lt;/code&gt; → &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;Date&lt;/code&gt; → &lt;code&gt;TIMESTAMP&lt;/code&gt;. Nested objects become structured columns addressable with dot notation. Arrays can be flattened with &lt;code&gt;FLATTEN&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MongoDB Atlas.&lt;/strong&gt; Add Dremio Cloud&apos;s IP range to your Atlas IP Access List. Enable SSL in the Dremio connection settings. Use the standard connection string hostname (not the SRV hostname).&lt;/p&gt;
&lt;h2&gt;When to Keep Data in MongoDB vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in MongoDB:&lt;/strong&gt; Data your application actively reads and writes, documents with evolving schemas that benefit from MongoDB&apos;s flexibility, operational data where real-time updates matter, data where document-level transactions are important.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical user data, analytics-heavy aggregations, datasets that need SQL joins with relational sources, time-series data you query in aggregate, data consumed primarily by BI tools or AI agents.&lt;/p&gt;
&lt;p&gt;For data that stays in MongoDB, create manual Reflections with refresh schedules matching your data freshness needs. This offloads analytical load from MongoDB while keeping data current. For migrated Iceberg data, Dremio provides automated compaction, time travel, results caching, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;Governance on MongoDB Data&lt;/h2&gt;
&lt;p&gt;MongoDB has database-level and collection-level access control, but no column masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask user emails, phone numbers, or payment details from specific roles. A product analyst sees user behavior patterns but not PII.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by user role. A regional team sees only their region&apos;s user data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across MongoDB, PostgreSQL, S3, Snowflake, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector : turns MongoDB documents into tabular data for Tableau&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to flattened MongoDB data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on MongoDB data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query MongoDB data from their IDE. Ask Copilot &amp;quot;Show me user signup trends from MongoDB&amp;quot; and get SQL generated using your semantic layer : no aggregation pipeline knowledge needed.&lt;/p&gt;
&lt;h2&gt;Schema Flattening and Nested Documents&lt;/h2&gt;
&lt;p&gt;MongoDB stores data as nested JSON documents. Dremio automatically converts nested structures into queryable columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Top-level fields&lt;/strong&gt; map directly to columns (&lt;code&gt;name&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;created_at&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nested objects&lt;/strong&gt; use dot notation (&lt;code&gt;address.city&lt;/code&gt;, &lt;code&gt;address.state&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Arrays&lt;/strong&gt; can be flattened using &lt;code&gt;FLATTEN()&lt;/code&gt; to create one row per array element&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Flatten nested order items from MongoDB documents
SELECT
  o.customer_id,
  o.order_date,
  f.item_name,
  f.quantity,
  f.unit_price
FROM &amp;quot;mongodb-app&amp;quot;.ecommerce.orders o,
LATERAL FLATTEN(o.items) AS f(item_name, quantity, unit_price);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This SQL approach is simpler than MongoDB&apos;s aggregation pipeline (&lt;code&gt;$unwind&lt;/code&gt;, &lt;code&gt;$lookup&lt;/code&gt;, &lt;code&gt;$group&lt;/code&gt;) for most analytical queries.&lt;/p&gt;
&lt;h2&gt;Dremio vs. MongoDB Atlas Data Federation&lt;/h2&gt;
&lt;p&gt;MongoDB Atlas Data Federation provides SQL-like access to MongoDB data. Key differences:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Dremio Cloud&lt;/th&gt;
&lt;th&gt;Atlas Data Federation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cross-source joins&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL, S3, Snowflake, etc.&lt;/td&gt;
&lt;td&gt;MongoDB + S3 only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Cache results&lt;/td&gt;
&lt;td&gt;❌ Query every time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Natural language queries&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column masking + row filtering&lt;/td&gt;
&lt;td&gt;MongoDB role-based access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BI connectivity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arrow Flight (10-100x faster)&lt;/td&gt;
&lt;td&gt;ODBC/JDBC only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Views with wiki + tags&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio provides a broader analytical platform, while Atlas Data Federation is specific to the MongoDB ecosystem.&lt;/p&gt;
&lt;h2&gt;Document-to-Analytics Pipeline&lt;/h2&gt;
&lt;p&gt;Optimize how MongoDB data flows into analytics:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Source layer:&lt;/strong&gt; Dremio reads MongoDB collections directly : no ETL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flattened views:&lt;/strong&gt; Create SQL views that flatten nested documents into tabular format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enrichment:&lt;/strong&gt; Join flattened MongoDB data with relational and data lake sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic layer:&lt;/strong&gt; Create business-ready views with wiki descriptions for AI&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pipeline runs entirely in SQL, eliminating the need for custom Python/Node.js ETL scripts to extract and transform MongoDB data.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;MongoDB users can query their document data with SQL, flatten nested structures, join MongoDB with relational databases and data lakes, and enable AI analytics : all without ETL pipelines or learning MongoDB&apos;s aggregation framework.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mongodb-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your MongoDB instances.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Vertica to Dremio Cloud: Federation for Analytics-Optimized Data</title><link>https://iceberglakehouse.com/posts/2026-03-connector-vertica/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-vertica/</guid><description>
Vertica is a columnar analytics database engineered for fast aggregate queries on large datasets. It was built from the ground up for analytical work...</description><pubDate>Sun, 01 Mar 2026 21:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Vertica is a columnar analytics database engineered for fast aggregate queries on large datasets. It was built from the ground up for analytical workloads : column-oriented storage, massively parallel processing, and automatic database design optimization. Organizations running Vertica typically have years of investment in analytics infrastructure: curated schemas, optimized projections, and sophisticated workloads that depend on Vertica&apos;s high-performance query engine.&lt;/p&gt;
&lt;p&gt;But Vertica has limitations that become more painful as data ecosystems grow. Licensing costs scale with data volume. Federation with non-Vertica sources requires complex ETL. And connecting Vertica data to modern cloud tools, AI platforms, and cross-cloud architectures requires exporting data or building custom connectors.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Vertica and queries it alongside your other data sources. Dremio&apos;s predicate pushdowns leverage Vertica&apos;s columnar engine for filtering and aggregation, while Reflections cache results to reduce ongoing Vertica compute load. You keep Vertica for what it does well and extend its reach to every other system in your organization.&lt;/p&gt;
&lt;h2&gt;Why Vertica Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Reduce Vertica License Costs&lt;/h3&gt;
&lt;p&gt;Vertica&apos;s licensing model ties cost to data volume and node count. Every analytical query consumes cluster resources. As your data grows and more teams want access, the cost of scaling Vertica becomes significant. Dremio&apos;s Reflections provide an alternative: pre-compute the results of your most common queries and serve them from Dremio&apos;s cache instead of hitting Vertica on every request. Dashboard queries, scheduled reports, and ad-hoc exploration can all be served from Reflections, reducing the compute pressure on your Vertica cluster.&lt;/p&gt;
&lt;h3&gt;Federate with Cloud Sources&lt;/h3&gt;
&lt;p&gt;Vertica excels at analytical queries on its own data, but your organization&apos;s data lives in many places: S3 data lakes, PostgreSQL application databases, Snowflake cloud warehouses, MongoDB document stores. Without a federation layer, combining these with Vertica data requires ETL pipelines that extract from each source, transform, and load into Vertica. Dremio queries each source in place and joins the results : no data movement needed.&lt;/p&gt;
&lt;h3&gt;Modernize Without a Big-Bang Migration&lt;/h3&gt;
&lt;p&gt;Migrating away from Vertica is a large, risky project. Dremio lets you gradually shift analytical workloads. Start by querying Vertica through Dremio alongside new cloud-native sources (Apache Iceberg tables, S3 data lakes). As confidence grows, migrate specific datasets from Vertica to Iceberg tables in Dremio&apos;s Open Catalog, where they benefit from automated maintenance and lower storage costs. The migration happens incrementally, and Vertica continues serving critical workloads throughout.&lt;/p&gt;
&lt;h3&gt;Unified Governance&lt;/h3&gt;
&lt;p&gt;Vertica has its own access control, but it doesn&apos;t extend to your other data sources. Dremio&apos;s Fine-Grained Access Control applies consistent column masking and row-level filtering across Vertica, PostgreSQL, S3, and every other connected source from a single governance layer.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Vertica hostname or IP address&lt;/strong&gt; : the coordinator node of your Vertica cluster&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; : Vertica defaults to &lt;code&gt;5433&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : a Vertica user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : port 5433 must be reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-vertica-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Vertica to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Vertica Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;Vertica&lt;/strong&gt; from the database source types.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;analytics-vertica&lt;/code&gt; or &lt;code&gt;web-analytics&lt;/code&gt;). This name appears in SQL queries as the source prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your Vertica coordinator host.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;5433&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The Vertica database name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Provide the username and password for a Vertica user with read access. You can also use Secret Resource URL for password management through AWS Secrets Manager.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Vertica&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle connection pool size&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;p&gt;Configure how often Reflections refresh (re-query Vertica) and how often Dremio checks for new tables or schema changes. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Vertica Data from Dremio&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT device_type, COUNT(*) AS sessions, AVG(session_duration_seconds) AS avg_duration,
  SUM(page_views) AS total_page_views
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions
WHERE session_date &amp;gt;= &apos;2024-01-01&apos; AND session_date &amp;lt; &apos;2024-07-01&apos;
GROUP BY device_type
ORDER BY sessions DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the date filter and aggregation to Vertica&apos;s columnar engine, which processes them efficiently against its compressed, column-oriented storage.&lt;/p&gt;
&lt;h2&gt;Federate Vertica with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Vertica web analytics with PostgreSQL CRM and S3 marketing data
SELECT
  c.customer_segment,
  COUNT(v.session_id) AS total_sessions,
  AVG(v.session_duration_seconds) AS avg_session_duration,
  COUNT(DISTINCT v.user_id) AS unique_visitors,
  SUM(s3.ad_spend) AS marketing_spend,
  ROUND(COUNT(v.session_id) / NULLIF(SUM(s3.ad_spend), 0) * 1000, 2) AS sessions_per_thousand_dollars
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions v
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers c ON v.user_id = c.customer_id
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.campaigns.spend_by_segment s3 ON c.customer_segment = s3.segment
WHERE v.session_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY c.customer_segment
ORDER BY total_sessions DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Vertica handles the session aggregation, PostgreSQL handles the customer lookup, and Dremio handles the cross-source join.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.web_performance AS
SELECT
  v.device_type,
  v.session_date,
  COUNT(*) AS sessions,
  AVG(v.session_duration_seconds) AS avg_duration_seconds,
  SUM(v.page_views) AS total_page_views,
  SUM(CASE WHEN v.converted = true THEN 1 ELSE 0 END) AS conversions,
  ROUND(SUM(CASE WHEN v.converted = true THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS conversion_rate_pct,
  CASE
    WHEN AVG(v.session_duration_seconds) &amp;gt; 300 THEN &apos;High Engagement&apos;
    WHEN AVG(v.session_duration_seconds) &amp;gt; 120 THEN &apos;Moderate Engagement&apos;
    ELSE &apos;Low Engagement&apos;
  END AS engagement_tier
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions v
GROUP BY v.device_type, v.session_date;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on the view, and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This creates the business context that powers AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Vertica Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions about your Vertica data in plain English. Instead of writing complex analytical SQL, a marketing manager can ask &amp;quot;What&apos;s our conversion rate on mobile this quarter?&amp;quot; The Agent reads the wiki descriptions attached to your views, understands what &amp;quot;conversion rate&amp;quot; and &amp;quot;mobile&amp;quot; mean in your data, and generates the correct SQL.&lt;/p&gt;
&lt;p&gt;The quality of the AI Agent&apos;s responses depends directly on the quality of your semantic layer. Wikis that explain &amp;quot;conversion_rate_pct is the percentage of web sessions that resulted in a purchase&amp;quot; produce better results than technical column names alone.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities to external chat clients. Connect Claude or ChatGPT to your Dremio data through the hosted MCP Server with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now your team can ask Claude &amp;quot;Analyze our web engagement trends from Vertica data this quarter&amp;quot; and get accurate, governed results : without writing SQL or accessing Vertica directly.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI SQL functions directly in queries to enrich Vertica data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify web sessions by potential value
SELECT
  session_id,
  device_type,
  page_views,
  session_duration_seconds,
  AI_CLASSIFY(
    &apos;Based on this browsing behavior, classify the user intent&apos;,
    &apos;Device: &apos; || device_type || &apos;, Pages: &apos; || CAST(page_views AS VARCHAR) || &apos;, Duration: &apos; || CAST(session_duration_seconds AS VARCHAR) || &apos;s&apos;,
    ARRAY[&apos;Purchase Intent&apos;, &apos;Research&apos;, &apos;Browsing&apos;, &apos;Bounced&apos;]
  ) AS predicted_intent
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions
WHERE session_date = CURRENT_DATE;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference inside your SQL query, classifying each web session from Vertica data into intent categories. &lt;code&gt;AI_GENERATE&lt;/code&gt; can produce narrative summaries, and &lt;code&gt;AI_SIMILARITY&lt;/code&gt; can find semantic matches between text fields.&lt;/p&gt;
&lt;h2&gt;Accelerate Vertica Queries with Reflections&lt;/h2&gt;
&lt;p&gt;Create Reflections on your most frequently queried views:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : for Vertica data that updates daily, daily refresh works; for real-time dashboards, match the refresh to your SLA&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected via Arrow Flight or ODBC get sub-second response times from Reflections, even though the underlying data lives in Vertica. A conversion analytics dashboard that queries Vertica 96 times per day with a daily Reflection refresh consumes Vertica resources only once : a 99% reduction in cluster load.&lt;/p&gt;
&lt;h2&gt;Governance Across Vertica and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides governance that extends beyond Vertica to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask conversion rates, revenue data, or user identifiers from specific roles. A product manager sees engagement metrics but not raw revenue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data visibility based on user roles. Regional teams see only their region&apos;s data automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Vertica, PostgreSQL, S3, BigQuery, and all other sources : no per-database policy management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access to Vertica analytics&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Vertica data from their IDE. Ask Copilot &amp;quot;Show me conversion rates by device type from web analytics&amp;quot; and get SQL generated from your semantic layer : without switching to the Dremio console.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Vertica vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Vertica:&lt;/strong&gt; Active analytical workloads optimized with Vertica projections, data with complex Vertica-specific features (database designer optimizations, flex tables), workloads that depend on Vertica&apos;s sub-second response times for real-time dashboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical data and archives, datasets consumed by non-Vertica tools, data where Vertica licensing cost per TB exceeds the analytical value, datasets that benefit from time travel and automated compaction.&lt;/p&gt;
&lt;p&gt;For data that stays in Vertica, create manual Reflections to reduce query load. For migrated data, Dremio&apos;s Open Catalog provides automated compaction, time travel, and Autonomous Reflections at a fraction of the per-TB cost.&lt;/p&gt;
&lt;h2&gt;Vertica Deployment Modes and Dremio&lt;/h2&gt;
&lt;p&gt;Vertica has two deployment modes, both compatible with Dremio:&lt;/p&gt;
&lt;h3&gt;Enterprise Mode (On-Premises)&lt;/h3&gt;
&lt;p&gt;Traditional deployment with local storage. Dremio connects via JDBC and pushes SQL operations to Vertica&apos;s engine when possible. Reflections are particularly valuable here : they offload analytical queries and reduce the on-premises compute needed.&lt;/p&gt;
&lt;h3&gt;EON Mode (Cloud-Optimized)&lt;/h3&gt;
&lt;p&gt;Vertica&apos;s compute-storage separation architecture on AWS, Azure, or GCP. Dremio connects the same way, but EON mode&apos;s elastic compute makes Reflections&apos; cost-saving impact even more significant , when Dremio serves cached results, EON subclusters can scale down.&lt;/p&gt;
&lt;h2&gt;Vertica-Specific SQL Considerations&lt;/h2&gt;
&lt;p&gt;Dremio handles most Vertica SQL natively. For Vertica-specific syntax:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Projections:&lt;/strong&gt; Vertica projections are transparent to Dremio : Vertica automatically uses optimal projections for queries pushed down&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flex tables:&lt;/strong&gt; Dremio reads flex table columns as VARCHAR : cast to appropriate types in your Dremio views&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;COPY LOCAL:&lt;/strong&gt; Not available through Dremio : use Dremio&apos;s own CREATE TABLE AS SELECT for data loading&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vertica ML functions:&lt;/strong&gt; Use external queries for Vertica-specific ML functions: &lt;code&gt;SELECT * FROM TABLE(&amp;quot;vertica-analytics&amp;quot;.EXTERNAL_QUERY(&apos;SELECT PREDICT_LINEAR...&apos;))&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Migration ROI Example&lt;/h2&gt;
&lt;p&gt;A mid-sized organization with 50TB in Vertica Enterprise:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Current cost:&lt;/strong&gt; ~$500K/year in Vertica licensing (per-TB pricing)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate 30TB of historical data to Iceberg:&lt;/strong&gt; Eliminates 60% of licensed data volume&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Remaining 20TB in Vertica:&lt;/strong&gt; Active analytical workloads, protected by Reflections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Net result:&lt;/strong&gt; Potential 40-60% reduction in Vertica licensing costs, with improved analytics capabilities (AI, federation, governance) on all data&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Vertica users can reduce licensing pressure, federate with cloud sources, modernize incrementally, and add AI analytics : all through Dremio Cloud. Connect your Vertica cluster to Dremio, create Reflections on your most-queried tables, and start tracking the reduction in Vertica query load as Dremio serves cached results.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-vertica-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Vertica cluster alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Azure Synapse Analytics to Dremio Cloud: Multi-Cloud Data Warehouse Federation</title><link>https://iceberglakehouse.com/posts/2026-03-connector-azure-synapse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-azure-synapse/</guid><description>
Microsoft Azure Synapse Analytics combines big data analytics and enterprise data warehousing into a single Azure-integrated platform. If your organi...</description><pubDate>Sun, 01 Mar 2026 20:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Microsoft Azure Synapse Analytics combines big data analytics and enterprise data warehousing into a single Azure-integrated platform. If your organization has chosen the Microsoft cloud ecosystem, your cleaned and modeled analytical data likely lives in Synapse dedicated SQL pools or serverless SQL pools. Synapse works well within Azure, but it creates challenges when you need to connect that data with AWS, Google Cloud, or on-premises databases. Azure Data Factory pipelines handle some of this, but they add cost, latency, and engineering complexity.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Azure Synapse and federates it with every other data source in your organization. Synapse queries push down to Synapse&apos;s engine for processing, and Dremio handles cross-source joins, query acceleration with Reflections, unified governance, and AI-powered analytics. You keep your investment in Synapse while extending its reach beyond the Azure ecosystem.&lt;/p&gt;
&lt;h2&gt;Why Azure Synapse Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Multi-Cloud Analytics Without Data Movement&lt;/h3&gt;
&lt;p&gt;Your Azure Synapse workspace holds curated sales and finance data, but your application database runs on Amazon RDS (PostgreSQL), your marketing attribution data is in Google BigQuery, and your raw event logs sit in Amazon S3. Without a federation layer, joining these datasets requires Azure Data Factory to extract data from non-Azure sources, transform it, and load it into Synapse : a process that can take hours and costs real money in compute and data egress.&lt;/p&gt;
&lt;p&gt;Dremio eliminates this entirely. Connect Synapse, PostgreSQL, BigQuery, and S3 as separate sources in Dremio, and write a single SQL query that joins across all four. Dremio&apos;s query optimizer pushes filtering and aggregation to each source (predicate pushdown), transfers only the results, and handles the cross-source join in its own engine. No pipelines. No data movement.&lt;/p&gt;
&lt;h3&gt;Cost Optimization Through Reflections&lt;/h3&gt;
&lt;p&gt;Synapse dedicated SQL pools charge based on the Data Warehouse Units (DWUs) provisioned, and serverless pools charge per TB of data processed. Dashboard queries that run every 15 minutes, ad-hoc exploration by analysts, and scheduled reports all consume Synapse compute resources.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s Reflections create pre-computed materializations of your most frequently run queries. After the initial execution, subsequent queries that match the Reflection pattern are served from Dremio&apos;s cache : not from Synapse. This can reduce Synapse compute consumption by 50-80% for dashboard and reporting workloads, directly lowering your Azure bill.&lt;/p&gt;
&lt;h3&gt;Unified Governance Across Clouds&lt;/h3&gt;
&lt;p&gt;Azure Synapse has role-based access control and Azure Active Directory integration within the Azure ecosystem. But those policies don&apos;t extend to your AWS databases or Google Cloud data. Dremio&apos;s Fine-Grained Access Control (FGAC) applies consistent column masking (hiding Social Security numbers, email addresses) and row-level filtering (restricting data by region or department) across Synapse and every other connected source. One governance policy, applied everywhere.&lt;/p&gt;
&lt;h3&gt;The Semantic Layer for Business Context&lt;/h3&gt;
&lt;p&gt;Raw Synapse tables have technical column names and no business context. Dremio lets you create views that encapsulate business logic (what &amp;quot;active customer&amp;quot; or &amp;quot;quarterly revenue&amp;quot; means), then attach wiki descriptions and labels to those views. This semantic layer makes your data self-documenting and powers Dremio&apos;s AI capabilities.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting Azure Synapse to Dremio Cloud, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Synapse SQL endpoint&lt;/strong&gt; : the fully qualified server name from your Synapse workspace (e.g., &lt;code&gt;myworkspace.sql.azuresynapse.net&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; : default &lt;code&gt;1433&lt;/code&gt; (Synapse uses the same port as SQL Server)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; : the specific SQL pool (dedicated or serverless) you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : SQL authentication credentials with read access to the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : Synapse&apos;s firewall must allow connections from Dremio Cloud&apos;s IP addresses. Configure this in the Synapse workspace&apos;s networking settings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-synapse-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Azure Synapse to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Synapse Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;Microsoft Azure Synapse Analytics&lt;/strong&gt; from the database source types. Alternatively, navigate to &lt;strong&gt;Databases&lt;/strong&gt; and click &lt;strong&gt;Add database&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier for this source (e.g., &lt;code&gt;synapse-analytics&lt;/code&gt; or &lt;code&gt;azure-sales-warehouse&lt;/code&gt;). This name appears in your SQL queries as the source prefix. Cannot include &lt;code&gt;/&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, or &lt;code&gt;]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your Synapse SQL endpoint (e.g., &lt;code&gt;myworkspace.sql.azuresynapse.net&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;1433&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The SQL pool name you want to connect to.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Master Credentials:&lt;/strong&gt; Enter the SQL authentication username and password with &lt;code&gt;SELECT&lt;/code&gt; permissions on the schemas and tables you want to query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Resource URL:&lt;/strong&gt; Store the password in AWS Secrets Manager and provide the ARN. Dremio fetches the password at connection time for centralized credential management.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Synapse&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle connection pool size&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS encryption&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection Refresh and Metadata&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection Refresh:&lt;/strong&gt; How often Dremio re-queries Synapse to update cached materializations. For dashboards with hourly data, set to 1-4 hours. For stable reporting data, daily or weekly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Refresh:&lt;/strong&gt; How often Dremio checks for new tables or schema changes. Default 1 hour for discovery, 1 hour for details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict which Dremio users or roles can access this Synapse source. Click &lt;strong&gt;Save&lt;/strong&gt; to create the connection.&lt;/p&gt;
&lt;h2&gt;Query Azure Synapse Data from Dremio&lt;/h2&gt;
&lt;p&gt;Once connected, browse your Synapse schemas and tables in the SQL Runner:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT region, product_line, SUM(revenue) AS total_revenue, COUNT(order_id) AS order_count
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.sales_summary
WHERE fiscal_year = 2024 AND region IN (&apos;EMEA&apos;, &apos;APAC&apos;, &apos;Americas&apos;)
GROUP BY region, product_line
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the &lt;code&gt;WHERE&lt;/code&gt; clause and aggregation to Synapse : only the summarized result crosses the network.&lt;/p&gt;
&lt;h2&gt;Federate Azure Synapse with Other Sources&lt;/h2&gt;
&lt;p&gt;The real power emerges when you combine Synapse data with non-Azure sources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Synapse sales data with AWS-hosted CRM and S3 marketing data
SELECT
  syn.region,
  syn.product_line,
  syn.total_revenue AS synapse_revenue,
  pg.customer_count,
  s3.marketing_spend,
  ROUND(syn.total_revenue / NULLIF(s3.marketing_spend, 0), 2) AS revenue_per_marketing_dollar
FROM (
  SELECT region, product_line, SUM(revenue) AS total_revenue
  FROM &amp;quot;synapse-analytics&amp;quot;.dbo.sales_summary
  WHERE fiscal_year = 2024
  GROUP BY region, product_line
) syn
LEFT JOIN (
  SELECT region, COUNT(DISTINCT customer_id) AS customer_count
  FROM &amp;quot;postgres-crm&amp;quot;.public.customers
  GROUP BY region
) pg ON syn.region = pg.region
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.campaigns.regional_spend s3
  ON syn.region = s3.region
ORDER BY revenue_per_marketing_dollar DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three clouds (Azure, AWS, S3), one query, no ETL pipelines.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over Synapse Data&lt;/h2&gt;
&lt;p&gt;Create views that translate technical Synapse schemas into business-friendly analytics:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.regional_performance AS
SELECT
  s.region,
  s.product_line,
  SUM(s.revenue) AS total_revenue,
  SUM(s.cost) AS total_cost,
  SUM(s.revenue) - SUM(s.cost) AS gross_profit,
  ROUND((SUM(s.revenue) - SUM(s.cost)) / NULLIF(SUM(s.revenue), 0) * 100, 1) AS profit_margin_pct,
  CASE
    WHEN SUM(s.revenue) &amp;gt; 1000000 THEN &apos;Major Market&apos;
    WHEN SUM(s.revenue) &amp;gt; 250000 THEN &apos;Growth Market&apos;
    ELSE &apos;Emerging Market&apos;
  END AS market_tier
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.sales_summary s
WHERE s.fiscal_year = 2024
GROUP BY s.region, s.product_line;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on this view, go to the &lt;strong&gt;Details&lt;/strong&gt; tab, and click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Dremio&apos;s generative AI samples the view schema and data to produce descriptions that help analysts and AI tools understand the dataset.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Synapse Data&lt;/h2&gt;
&lt;p&gt;Dremio provides three AI capabilities that transform how you work with Synapse data:&lt;/p&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions about your Synapse data in plain English. Instead of writing SQL, a business user can ask &amp;quot;What&apos;s our profit margin by region?&amp;quot; and the AI Agent generates the correct SQL based on the semantic layer (wikis, labels, view definitions) you&apos;ve built.&lt;/p&gt;
&lt;p&gt;The AI Agent reads the wiki descriptions you attached to your views to understand what columns mean in business terms. This is why the semantic layer matters : better metadata produces more accurate AI-generated queries. For example, if your &lt;code&gt;regional_performance&lt;/code&gt; view has a wiki that says &amp;quot;profit_margin_pct represents the gross profit margin after cost of goods sold,&amp;quot; the Agent uses that context to correctly answer &amp;quot;Which regions are most profitable?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities beyond Dremio&apos;s own interface. It&apos;s an open-source project that enables AI chat clients like Claude and ChatGPT to securely interact with your Dremio data using natural language.&lt;/p&gt;
&lt;p&gt;The Dremio-hosted MCP Server provides OAuth support, which guarantees user identity, authentication, and authorization for all interactions. Once connected, you can use natural language in Claude or ChatGPT to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Explore your Synapse data schemas and tables&lt;/li&gt;
&lt;li&gt;Run analytical queries and get results&lt;/li&gt;
&lt;li&gt;Create visualizations from query results&lt;/li&gt;
&lt;li&gt;Build and save views&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Setup is straightforward:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure the redirect URLs for your AI chat client&lt;/li&gt;
&lt;li&gt;Connect using the MCP endpoint: &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This means a marketing manager can ask Claude &amp;quot;Show me our top 5 regions by profit margin from the Synapse sales data&amp;quot; and get accurate, governed results : without knowing SQL or having direct Synapse access.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Dremio provides built-in AI SQL functions that you can use directly in queries against any connected data, including Synapse:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify products based on their Synapse metadata
SELECT
  product_line,
  total_revenue,
  AI_CLASSIFY(
    &apos;Based on this revenue and growth pattern, classify the product health&apos;,
    product_line || &apos;: $&apos; || CAST(total_revenue AS VARCHAR) || &apos; revenue&apos;,
    ARRAY[&apos;Thriving&apos;, &apos;Stable&apos;, &apos;Declining&apos;, &apos;At Risk&apos;]
  ) AS product_health
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.product_summary;

-- Generate summaries from Synapse data
SELECT
  region,
  AI_GENERATE(
    &apos;Write a one-sentence business summary for this regional performance&apos;,
    &apos;Region: &apos; || region || &apos;, Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Growth: &apos; || CAST(yoy_growth AS VARCHAR) || &apos;%&apos;
  ) AS executive_summary
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.regional_metrics;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These functions run LLM inference directly in your SQL queries, turning raw Synapse data into AI-enriched insights.&lt;/p&gt;
&lt;h2&gt;Accelerate Synapse Queries with Reflections&lt;/h2&gt;
&lt;p&gt;For queries that run repeatedly (dashboard refreshes, scheduled reports):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view over your Synapse data (like &lt;code&gt;regional_performance&lt;/code&gt; above).&lt;/li&gt;
&lt;li&gt;In the Catalog, select the view and create a &lt;strong&gt;Reflection&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Choose the columns and aggregations to include.&lt;/li&gt;
&lt;li&gt;Set the refresh interval (how often Dremio re-queries Synapse to update the Reflection).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After the Reflection is built, Dremio&apos;s query optimizer automatically routes matching queries to the Reflection. Your BI tools (Power BI, Tableau) connected via Arrow Flight or ODBC get sub-second responses from the Reflection instead of waiting for Synapse to process the query. The acceleration is completely transparent : users write the same SQL and see the same data, just faster.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Synapse vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Synapse:&lt;/strong&gt; Data actively consumed by Azure-native tools (Power BI with DirectQuery, Azure Machine Learning), data with complex Synapse-specific transformations, data shared through Azure Data Share.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical archive data that&apos;s rarely updated, large analytical datasets that would benefit from automated compaction and manifest optimization, datasets that need time travel (query as of any past timestamp), data that other teams access through non-Azure tools.&lt;/p&gt;
&lt;p&gt;For data that stays in Synapse, create manual Reflections with refresh schedules matching your data freshness requirements. For migrated Iceberg data, Dremio&apos;s Open Catalog provides automated compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;Governance Across Azure Synapse and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) extends Synapse&apos;s Azure AD-based security to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask revenue, cost, and margin data from specific roles. A marketing analyst sees conversion counts but not financial details.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional managers see only their region&apos;s data automatically across all sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies to Synapse, PostgreSQL, S3, BigQuery, and all other sources : no per-service security configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio native connector : ideal for Azure-centric organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Synapse data from their IDE. Ask Copilot &amp;quot;Show me regional profit margins from Azure Synapse&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Azure Synapse users can extend their warehouse beyond the Azure ecosystem, reduce compute costs with Reflections, and enable AI-powered analytics across all their data sources : all through Dremio Cloud.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-synapse-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Azure Synapse workspace alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Snowflake to Dremio Cloud: Federate, Govern, and Accelerate Beyond Snowflake</title><link>https://iceberglakehouse.com/posts/2026-03-connector-snowflake/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-snowflake/</guid><description>
Snowflake is a popular cloud data warehouse known for its separation of storage and compute, near-zero maintenance, and broad ecosystem. Many organiz...</description><pubDate>Sun, 01 Mar 2026 19:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Snowflake is a popular cloud data warehouse known for its separation of storage and compute, near-zero maintenance, and broad ecosystem. Many organizations have made Snowflake their primary analytics platform. But as data ecosystems mature, limitations emerge: Snowflake credits are consumed on every query, connecting Snowflake data to non-Snowflake sources requires data sharing agreements or ETL, and running all workloads in Snowflake means paying Snowflake prices for everything : including repetitive dashboard queries and ad-hoc exploration.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Snowflake as a federated data source. You can query Snowflake tables directly, join them with PostgreSQL, S3, MongoDB, BigQuery, and any other connected source in a single SQL query, and accelerate repeated queries with Reflections so they don&apos;t burn Snowflake credits on every execution.&lt;/p&gt;
&lt;p&gt;Snowflake&apos;s native Iceberg Tables feature allows managing Iceberg-formatted data within Snowflake. However, this still keeps your compute costs within Snowflake&apos;s pricing model. By combining Dremio Cloud with Snowflake (and potentially Snowflake&apos;s Open Catalog for shared Iceberg access), organizations can use Snowflake for data engineering while leveraging Dremio for cost-optimized analytical serving. This hybrid approach gives you Snowflake&apos;s data engineering strengths without paying Snowflake credit rates for every analytical query.&lt;/p&gt;
&lt;p&gt;The cost concern is real: organizations regularly report that 40-60% of their Snowflake spend comes from dashboards, scheduled reports, and ad-hoc queries : workloads that are fundamentally repetitive and ideal for Reflection-based caching.&lt;/p&gt;
&lt;h2&gt;Why Snowflake Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Reduce Snowflake Credit Consumption&lt;/h3&gt;
&lt;p&gt;Every query in Snowflake consumes credits based on the warehouse size and query runtime. Dashboard queries that run every 15 minutes, analytics training sessions, ad-hoc data exploration by 50 analysts, and nightly scheduled reports all consume credits.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s Reflections create pre-computed materializations of frequently executed queries. After the initial run, matching queries are served from Dremio&apos;s cache instead of Snowflake. For organizations spending over $100K/year on Snowflake compute, routing read-heavy analytical and dashboard workloads through Dremio can reduce credit consumption by 30-70% on those workloads.&lt;/p&gt;
&lt;h3&gt;Federation Beyond Snowflake&lt;/h3&gt;
&lt;p&gt;Snowflake&apos;s data sharing works between Snowflake accounts. But what about your PostgreSQL application database, your S3 data lake, your MongoDB user profiles, or your on-premises Oracle ERP? Joining these with Snowflake data requires ETL pipelines : extracting from each source, transforming, and loading into Snowflake. Dremio queries each source in place and joins the results in its own engine. No data movement, no Snowflake ingestion costs.&lt;/p&gt;
&lt;h3&gt;Unified Governance&lt;/h3&gt;
&lt;p&gt;Snowflake has robust access controls within Snowflake. But governing data across Snowflake, PostgreSQL, S3, and MongoDB requires separate policies in each system. Dremio&apos;s Fine-Grained Access Control applies consistent column masking and row-level filtering across all connected sources from a single interface.&lt;/p&gt;
&lt;h3&gt;AI Analytics Across All Sources&lt;/h3&gt;
&lt;p&gt;Snowflake has AI/ML features within its ecosystem (Cortex). Dremio adds AI capabilities that span your entire data estate, not just Snowflake : including an AI Agent for natural language queries, an MCP Server for external AI tools, and SQL-level AI functions.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snowflake account URL&lt;/strong&gt; (e.g., &lt;code&gt;myaccount.snowflakecomputing.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; (or OAuth/key pair authentication)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Warehouse name&lt;/strong&gt; : the compute resource Snowflake uses for queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; : the Snowflake database you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; from Dremio Cloud to your Snowflake instance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Snowflake to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Snowflake Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the left sidebar and select &lt;strong&gt;Snowflake&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;snowflake-warehouse&lt;/code&gt; or &lt;code&gt;analytics-snowflake&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Account URL:&lt;/strong&gt; Your Snowflake account URL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Warehouse:&lt;/strong&gt; The Snowflake virtual warehouse to use for queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The Snowflake database to connect to.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose from Master Credentials (username/password), OAuth, or key pair authentication.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom Snowflake connection parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;p&gt;Configure how often Reflections refresh and how often Dremio checks for schema changes. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Snowflake Data from Dremio&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  product_category,
  SUM(sales_amount) AS total_sales,
  COUNT(DISTINCT customer_id) AS unique_buyers,
  ROUND(SUM(sales_amount) / COUNT(DISTINCT customer_id), 2) AS avg_spend_per_customer
FROM &amp;quot;snowflake-warehouse&amp;quot;.PUBLIC.SALES_FACT
WHERE sale_date &amp;gt;= &apos;2024-01-01&apos; AND sale_date &amp;lt; &apos;2024-07-01&apos;
GROUP BY product_category
ORDER BY total_sales DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate Snowflake with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Snowflake sales with PostgreSQL reviews and S3 return data
SELECT
  sf.product_category,
  sf.total_sales,
  sf.unique_buyers,
  pg.avg_review_score,
  pg.review_count,
  s3.return_rate,
  ROUND(sf.total_sales * (1 - s3.return_rate), 2) AS net_revenue
FROM (
  SELECT product_category, SUM(sales_amount) AS total_sales, COUNT(DISTINCT customer_id) AS unique_buyers
  FROM &amp;quot;snowflake-warehouse&amp;quot;.PUBLIC.SALES_FACT
  WHERE sale_date &amp;gt;= &apos;2024-01-01&apos;
  GROUP BY product_category
) sf
LEFT JOIN &amp;quot;postgres-reviews&amp;quot;.public.product_reviews pg ON sf.product_category = pg.category
LEFT JOIN &amp;quot;s3-analytics&amp;quot;.returns.category_return_rates s3 ON sf.product_category = s3.category
ORDER BY net_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.product_health AS
SELECT
  sf.product_category,
  SUM(sf.sales_amount) AS total_revenue,
  COUNT(DISTINCT sf.customer_id) AS unique_customers,
  ROUND(SUM(sf.sales_amount) / COUNT(DISTINCT sf.customer_id), 2) AS customer_value,
  CASE
    WHEN SUM(sf.sales_amount) &amp;gt; 1000000 THEN &apos;Category Leader&apos;
    WHEN SUM(sf.sales_amount) &amp;gt; 250000 THEN &apos;Growth Category&apos;
    ELSE &apos;Emerging&apos;
  END AS category_tier
FROM &amp;quot;snowflake-warehouse&amp;quot;.PUBLIC.SALES_FACT sf
GROUP BY sf.product_category;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) in the Catalog, then &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt; to create AI-readable business context.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Snowflake Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Users ask questions in plain English: &amp;quot;Which product categories are growing fastest?&amp;quot; The AI Agent reads your wiki descriptions and generates accurate SQL. The semantic layer you&apos;ve built is the foundation : better descriptions mean better AI responses.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI tools (Claude, ChatGPT) to your Dremio data with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A product manager can ask ChatGPT &amp;quot;What are our top 5 product categories by net revenue from Snowflake?&amp;quot; and get governed, accurate results.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate product insights with AI
SELECT
  product_category,
  total_revenue,
  customer_value,
  AI_GENERATE(
    &apos;Write a one-sentence product strategy recommendation&apos;,
    &apos;Category: &apos; || product_category || &apos;, Revenue: $&apos; || CAST(total_revenue AS VARCHAR) || &apos;, Customer Value: $&apos; || CAST(customer_value AS VARCHAR) || &apos;, Tier: &apos; || category_tier
  ) AS strategy_recommendation
FROM analytics.gold.product_health;

-- Classify product categories
SELECT
  product_category,
  AI_CLASSIFY(
    &apos;Based on these metrics, classify the investment priority&apos;,
    &apos;Revenue: $&apos; || CAST(total_revenue AS VARCHAR) || &apos;, Customers: &apos; || CAST(unique_customers AS VARCHAR),
    ARRAY[&apos;High Priority&apos;, &apos;Medium Priority&apos;, &apos;Low Priority&apos;, &apos;Divest&apos;]
  ) AS investment_priority
FROM analytics.gold.product_health;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;Create Reflections on frequently queried Snowflake views to offload repeated queries from Snowflake credits:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, navigate to the view you want to accelerate&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full dataset cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed SUM/COUNT/AVG)&lt;/li&gt;
&lt;li&gt;Select the columns and aggregations to include&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : balance between data freshness and Snowflake credit consumption&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After creation, Dremio&apos;s query optimizer automatically routes matching queries to the Reflection. Dashboard queries and scheduled reports hit the cache instead of consuming Snowflake credits. BI tools connected via Arrow Flight get sub-second response times.&lt;/p&gt;
&lt;h3&gt;Example: Dashboard Acceleration&lt;/h3&gt;
&lt;p&gt;A Tableau dashboard that refreshes every 15 minutes queries &lt;code&gt;product_health&lt;/code&gt;. Without Reflections, that&apos;s 96 Snowflake queries per day. With a Reflection that refreshes every 2 hours, Dremio serves 84 of those queries from cache : an 87.5% reduction in Snowflake credit consumption for that dashboard alone. Multiply that across 50 dashboards and the savings become significant.&lt;/p&gt;
&lt;h2&gt;Governance Across Snowflake and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides governance capabilities that work across Snowflake and every other connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive customer data (PII, financial details) from specific user roles. A marketing analyst sees &lt;code&gt;customer_name&lt;/code&gt; but not &lt;code&gt;social_security_number&lt;/code&gt;. An auditor sees both.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically filter data based on the querying user&apos;s role. A regional manager sees only their region&apos;s data across all sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance rules apply whether data comes from Snowflake, PostgreSQL, S3, or any other source : no per-source policy management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across all access methods: SQL Runner, BI tools, AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer compared to JDBC/ODBC. After building views over Snowflake data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC with Dremio&apos;s driver&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and semantic layer context.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Snowflake data from their IDE. Ask Copilot &amp;quot;Show me product health metrics from Snowflake&amp;quot; and it generates SQL using your semantic layer : without switching to the Dremio console or Snowflake&apos;s Worksheets.&lt;/p&gt;
&lt;h2&gt;External Queries&lt;/h2&gt;
&lt;p&gt;For Snowflake-specific functions not natively supported in Dremio&apos;s SQL, use external queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(
  &amp;quot;snowflake-warehouse&amp;quot;.EXTERNAL_QUERY(
    &apos;SELECT APPROX_COUNT_DISTINCT(customer_id), MEDIAN(sales_amount) FROM PUBLIC.SALES_FACT WHERE sale_date &amp;gt;= &apos;&apos;2024-01-01&apos;&apos;&apos;
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;External queries pass raw SQL to Snowflake for execution, returning results through Dremio. This is useful for functions like &lt;code&gt;APPROX_COUNT_DISTINCT&lt;/code&gt;, &lt;code&gt;QUALIFY&lt;/code&gt;, or Snowflake-specific window functions.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Snowflake vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Snowflake:&lt;/strong&gt; Data consumed by Snowflake-native tools (Snowpipe, Streams, Tasks), data shared through Snowflake Data Sharing, workloads with Snowflake-specific features (materialized views, dynamic tables), datasets actively managed by Snowflake-based ETL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical data that rarely changes, archival tables, datasets consumed primarily through non-Snowflake tools, workloads where Snowflake credit costs exceed the analytical value delivered. Migrated Iceberg tables benefit from Dremio&apos;s automatic compaction, time travel, Autonomous Reflections, and zero per-query storage costs.&lt;/p&gt;
&lt;p&gt;For data that stays in Snowflake, create manual Reflections to reduce credit consumption. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Snowflake Credit Optimization with Dremio&lt;/h2&gt;
&lt;h3&gt;Credit Consumption by Warehouse Size&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Warehouse Size&lt;/th&gt;
&lt;th&gt;Credits/Hour&lt;/th&gt;
&lt;th&gt;Dremio Reflection Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;X-Small&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Reflections serve cached queries : warehouse suspends faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same pattern : faster auto-suspend reduces credit burn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dashboard workloads offloaded : downsize to Small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Interactive + scheduled workloads offloaded : significant savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;X-Large&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;Heavy analytical workloads cached : potential 50%+ reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Quantifying Credit Savings&lt;/h3&gt;
&lt;p&gt;Example calculation for a medium-sized analytics team:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Without Dremio:&lt;/strong&gt; 50 analysts + 20 dashboards consume ~$15,000/month in Snowflake credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;With Dremio Reflections:&lt;/strong&gt; Dashboard queries (60% of total) served from cache → ~$6,000/month savings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Net impact:&lt;/strong&gt; $9,000/month Snowflake bill + Dremio costs, typically netting 20-40% total savings&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snowflake Data Cloud Integration&lt;/h3&gt;
&lt;p&gt;Dremio doesn&apos;t replace Snowflake&apos;s Data Cloud capabilities : it complements them:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Sharing:&lt;/strong&gt; Continue sharing datasets via Snowflake Data Sharing with other Snowflake accounts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Marketplace:&lt;/strong&gt; Access Snowflake Marketplace datasets alongside your own, but federate them with non-Snowflake sources through Dremio&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowpark:&lt;/strong&gt; Continue using Snowpark for Python/Java/Scala processing within Snowflake&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&apos;s role:&lt;/strong&gt; Federation with non-Snowflake data, AI analytics, Reflection-based BI serving, and unified governance&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Snowflake users can reduce credit consumption, federate beyond Snowflake&apos;s ecosystem, and add AI analytics : all through Dremio Cloud. The combination of Reflections (offloading repetitive dashboard and report queries), federation (joining Snowflake with PostgreSQL, S3, MongoDB, and other sources without ETL), and AI capabilities (Agent, MCP Server, SQL Functions) makes Dremio a natural complement to any Snowflake deployment.&lt;/p&gt;
&lt;p&gt;Start by connecting Snowflake to Dremio Cloud, creating Reflections on your most-queried views, and monitoring the reduction in Snowflake credit consumption. Most organizations see measurable savings within the first week as dashboard queries shift to Dremio&apos;s Reflection cache. The setup takes minutes and the ROI is immediate.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Snowflake warehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Google BigQuery to Dremio Cloud: Cross-Cloud Analytics Without Data Movement</title><link>https://iceberglakehouse.com/posts/2026-03-connector-google-bigquery/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-google-bigquery/</guid><description>
Google BigQuery is Google Cloud&apos;s serverless data warehouse. If your organization uses Google Cloud Platform, BigQuery is where your analytics data, ...</description><pubDate>Sun, 01 Mar 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google BigQuery is Google Cloud&apos;s serverless data warehouse. If your organization uses Google Cloud Platform, BigQuery is where your analytics data, marketing attribution, Google Analytics exports, and machine learning model outputs live. BigQuery is powerful within Google&apos;s ecosystem, but it creates challenges when your data spans multiple clouds or when costs grow with usage.&lt;/p&gt;
&lt;p&gt;BigQuery&apos;s on-demand pricing charges per terabyte scanned. For organizations with large datasets queried frequently :  especially by dashboards that refresh automatically ,  this can result in monthly bills that grow unpredictably. And connecting BigQuery data to non-Google tools and other cloud providers requires data exports, cross-cloud networking, or third-party ETL platforms.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to BigQuery and queries it alongside data from AWS, Azure, on-premises databases, and any other connected source. You get multi-cloud federation without data movement, AI-powered analytics, and cost optimization through Reflections.&lt;/p&gt;
&lt;p&gt;Data gravity is a real challenge for BigQuery users. Once data lands in BigQuery, Google&apos;s ecosystem encourages keeping everything there : Looker for BI, Vertex AI for ML, Cloud Dataflow for processing. But most enterprises aren&apos;t all-Google. They have data in AWS RDS, Azure SQL, S3 data lakes, and on-premises systems. Moving all that data into BigQuery is expensive (ingestion costs, ongoing storage) and creates vendor lock-in. Dremio&apos;s federation approach queries each source in place, avoiding the data gravity trap while still giving you unified analytics across your entire data estate.&lt;/p&gt;
&lt;h2&gt;Why BigQuery Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Control BigQuery Costs with Reflections&lt;/h3&gt;
&lt;p&gt;BigQuery&apos;s on-demand pricing charges per terabyte scanned, regardless of whether you&apos;ve run the same query before. A dashboard that refreshes every 15 minutes, querying the same 500GB table, generates substantial costs. Dremio&apos;s Reflections solve this: after the first query execution, Dremio caches the results as a pre-computed materialization. Subsequent queries that match the Reflection pattern are served from cache : no BigQuery scan, no per-TB charge.&lt;/p&gt;
&lt;p&gt;For organizations with heavy dashboard and reporting workloads, this can reduce BigQuery costs by 50-80% on those specific query patterns.&lt;/p&gt;
&lt;h3&gt;Multi-Cloud Analytics&lt;/h3&gt;
&lt;p&gt;Your Google Analytics data is in BigQuery, your application database is in PostgreSQL (running on AWS RDS), your product catalog is in SQL Server (on Azure), and your raw event logs are in Amazon S3. Without a federation layer, joining these datasets requires building ETL pipelines for each source-destination pair. Dremio eliminates this: connect all four as sources and write a single SQL query that joins across them.&lt;/p&gt;
&lt;h3&gt;Unified Governance Across Clouds&lt;/h3&gt;
&lt;p&gt;BigQuery has IAM policies and column-level security within Google Cloud. But those policies don&apos;t extend to your PostgreSQL database, S3 data lake, or Snowflake warehouse. Dremio&apos;s Fine-Grained Access Control (FGAC) applies consistent row-level security and column masking across BigQuery and every other connected source. One governance policy, everywhere.&lt;/p&gt;
&lt;h3&gt;The Semantic Layer for AI&lt;/h3&gt;
&lt;p&gt;Raw BigQuery tables have technical column names and fragmented schemas. Dremio lets you create views that consolidate and rename these into business-friendly structures, then attach wiki descriptions and labels. This semantic layer makes your BigQuery data queryable by AI tools : both Dremio&apos;s built-in AI Agent and external AI clients through the MCP Server.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Cloud project ID&lt;/strong&gt; : the GCP project containing your BigQuery datasets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Account JSON key&lt;/strong&gt; : a GCP service account with the BigQuery Data Viewer role (or custom role with &lt;code&gt;bigquery.tables.getData&lt;/code&gt;, &lt;code&gt;bigquery.jobs.create&lt;/code&gt; permissions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : Dremio Cloud connects to Google Cloud APIs over HTTPS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-google-bigquery-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect BigQuery to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the BigQuery Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the left sidebar and select &lt;strong&gt;Google BigQuery&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;bigquery-marketing&lt;/code&gt; or &lt;code&gt;gcp-analytics&lt;/code&gt;). This appears in SQL queries as the source prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project ID:&lt;/strong&gt; Your Google Cloud project ID (e.g., &lt;code&gt;my-company-analytics-123456&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Account Key:&lt;/strong&gt; Upload or paste the JSON key file for your service account.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Configure Advanced Settings&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching Enabled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cache BigQuery metadata locally for faster schema browsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Billing Project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specify which GCP project is billed for queries (important for cross-project access)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom parameters for the BigQuery connection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;4. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection Refresh:&lt;/strong&gt; How often Dremio re-queries BigQuery to update cached Reflections. Balance between data freshness and BigQuery scan costs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Refresh:&lt;/strong&gt; How often Dremio checks for new datasets or schema changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict access, then click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query BigQuery Data from Dremio&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query BigQuery marketing data
SELECT
  campaign_name,
  SUM(clicks) AS total_clicks,
  SUM(conversions) AS total_conversions,
  ROUND(SUM(conversions) * 100.0 / NULLIF(SUM(clicks), 0), 2) AS conversion_rate
FROM &amp;quot;bigquery-marketing&amp;quot;.analytics.campaign_metrics
WHERE date &amp;gt;= &apos;2024-01-01&apos; AND date &amp;lt; &apos;2024-07-01&apos;
GROUP BY campaign_name
ORDER BY total_conversions DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate BigQuery with Other Clouds&lt;/h2&gt;
&lt;p&gt;Join BigQuery marketing data with AWS-hosted application data and Azure revenue:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  bq.campaign_name,
  bq.total_clicks,
  bq.total_conversions,
  SUM(pg.order_total) AS attributed_revenue,
  ROUND(SUM(pg.order_total) / NULLIF(bq.total_conversions, 0), 2) AS revenue_per_conversion
FROM (
  SELECT campaign_name, user_id, SUM(clicks) AS total_clicks, SUM(conversions) AS total_conversions
  FROM &amp;quot;bigquery-marketing&amp;quot;.analytics.campaign_clicks
  WHERE date &amp;gt;= &apos;2024-01-01&apos;
  GROUP BY campaign_name, user_id
) bq
JOIN &amp;quot;postgres-orders&amp;quot;.public.orders pg
  ON bq.user_id = pg.customer_id
  AND pg.order_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY bq.campaign_name, bq.total_clicks, bq.total_conversions
ORDER BY attributed_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three clouds, one query, zero ETL.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.campaign_performance AS
SELECT
  bq.campaign_name,
  SUM(bq.clicks) AS total_clicks,
  SUM(bq.conversions) AS total_conversions,
  ROUND(SUM(bq.conversions) * 100.0 / NULLIF(SUM(bq.clicks), 0), 2) AS conversion_rate_pct,
  SUM(bq.cost) AS total_ad_spend,
  CASE
    WHEN SUM(bq.conversions) * 100.0 / NULLIF(SUM(bq.clicks), 0) &amp;gt; 5 THEN &apos;High Performer&apos;
    WHEN SUM(bq.conversions) * 100.0 / NULLIF(SUM(bq.clicks), 0) &amp;gt; 2 THEN &apos;Average&apos;
    ELSE &apos;Underperforming&apos;
  END AS campaign_grade
FROM &amp;quot;bigquery-marketing&amp;quot;.analytics.campaign_metrics bq
GROUP BY bq.campaign_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt; to create business context for AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on BigQuery Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions in plain English: &amp;quot;Which campaigns had the highest conversion rate this quarter?&amp;quot; The Agent reads your wiki descriptions to understand what &amp;quot;conversion rate&amp;quot; and &amp;quot;high performer&amp;quot; mean, then generates accurate SQL.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Dremio data. Setup:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A marketing executive can ask Claude &amp;quot;Compare our Q1 campaign performance against Q2 using the BigQuery data&amp;quot; and get governed, accurate results : no SQL required.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against BigQuery data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify campaign performance with AI
SELECT
  campaign_name,
  total_clicks,
  conversion_rate_pct,
  AI_CLASSIFY(
    &apos;Based on these marketing metrics, recommend a budget action&apos;,
    &apos;Campaign: &apos; || campaign_name || &apos;, Clicks: &apos; || CAST(total_clicks AS VARCHAR) || &apos;, Conversion Rate: &apos; || CAST(conversion_rate_pct AS VARCHAR) || &apos;%&apos;,
    ARRAY[&apos;Increase Budget&apos;, &apos;Maintain Budget&apos;, &apos;Decrease Budget&apos;, &apos;Pause Campaign&apos;]
  ) AS budget_recommendation
FROM analytics.gold.campaign_performance;

-- Generate executive summaries
SELECT
  campaign_name,
  AI_GENERATE(
    &apos;Write a brief performance summary for this marketing campaign&apos;,
    &apos;Campaign: &apos; || campaign_name || &apos;, Clicks: &apos; || CAST(total_clicks AS VARCHAR) || &apos;, Conversions: &apos; || CAST(total_conversions AS VARCHAR) || &apos;, Spend: $&apos; || CAST(total_ad_spend AS VARCHAR)
  ) AS performance_summary
FROM analytics.gold.campaign_performance
WHERE campaign_grade = &apos;High Performer&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; categorizes data with AI. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces narrative text. Both run inside your SQL query.&lt;/p&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;For dashboard queries that run repeatedly against BigQuery:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view over your BigQuery data&lt;/li&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt; and click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and set the refresh interval&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Subsequent matching queries hit the Reflection instead of scanning BigQuery. This is particularly valuable for BigQuery&apos;s on-demand pricing, where every scan costs money. A dashboard with 10 widgets refreshing every 15 minutes would generate 960 BigQuery scans per day; with Reflections refreshing hourly, Dremio serves 936 of those from cache.&lt;/p&gt;
&lt;h2&gt;Governance Across BigQuery and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides governance that works across BigQuery and every other source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask ad spend or conversion data from specific roles. A content creator sees campaign impressions but not revenue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional marketers see only campaigns in their territory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same rules apply to BigQuery, PostgreSQL, S3, and all connected sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools, AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Ideal for Google Cloud environments : connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query BigQuery data from their IDE. Ask Copilot &amp;quot;Show me campaign conversion rates from BigQuery&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in BigQuery vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in BigQuery:&lt;/strong&gt; Data consumed by Google-native tools (Looker, Google Data Studio, Vertex AI), data pipelines managed by Cloud Dataflow or Dataproc, datasets with BigQuery ML models, data shared via BigQuery analytics hub.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical archive data, datasets queried by non-Google tools, data that benefits from Iceberg&apos;s time travel and automated compaction, workloads where BigQuery per-TB costs exceed value. Migrated Iceberg tables get Dremio&apos;s automatic maintenance and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in BigQuery, create manual Reflections to eliminate per-TB scan costs for repeated queries.&lt;/p&gt;
&lt;h2&gt;BigQuery Cost Optimization with Dremio&lt;/h2&gt;
&lt;h3&gt;BigQuery Pricing Models&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;How It&apos;s Priced&lt;/th&gt;
&lt;th&gt;Dremio&apos;s Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On-Demand&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$6.25 per TB scanned&lt;/td&gt;
&lt;td&gt;Reflections eliminate repeat scans : 50-80% cost reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Editions (Standard/Enterprise/Enterprise Plus)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slot reservations (autoscaling)&lt;/td&gt;
&lt;td&gt;Reflections reduce slot utilization, enabling lower commitments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flat Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed slot reservations&lt;/td&gt;
&lt;td&gt;Reflections free up slots for other workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Google Analytics 4 (GA4) Integration&lt;/h3&gt;
&lt;p&gt;BigQuery is the default export destination for Google Analytics 4 data. GA4 exports create daily event tables (&lt;code&gt;events_YYYYMMDD&lt;/code&gt;) with nested schemas. Dremio handles this pattern:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query GA4 events from BigQuery through Dremio
SELECT
  event_name,
  COUNT(*) AS event_count,
  COUNT(DISTINCT user_pseudo_id) AS unique_users,
  DATE_TRUNC(&apos;day&apos;, CAST(event_timestamp AS TIMESTAMP)) AS event_day
FROM &amp;quot;bigquery-analytics&amp;quot;.analytics_12345678.events_*
WHERE event_name IN (&apos;page_view&apos;, &apos;purchase&apos;, &apos;add_to_cart&apos;)
GROUP BY 1, 4
ORDER BY event_day DESC, event_count DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By creating Reflections on GA4 views, you can serve real-time marketing dashboards without accumulating BigQuery scan costs.&lt;/p&gt;
&lt;h3&gt;Multi-Cloud Analytics Strategy&lt;/h3&gt;
&lt;p&gt;For organizations with data across Google Cloud, AWS, and Azure:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt; holds your Google-native data (GA4, Google Ads, Cloud Storage exports)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S3&lt;/strong&gt; holds your AWS data lake (application logs, IoT telemetry)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure Storage&lt;/strong&gt; holds your Microsoft ecosystem data (Power Platform exports, Azure services)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL/MySQL&lt;/strong&gt; hold operational application data&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio federates across all four clouds, applies unified governance, and serves all BI tools from a single connection. This eliminates the need for cross-cloud ETL pipelines.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;BigQuery users can break out of Google Cloud&apos;s walled garden, reduce per-TB scan costs with Reflections, and enable AI analytics across their entire data estate. Whether you&apos;re running a single BigQuery project or managing data across dozens of GCP projects alongside AWS and Azure resources, Dremio provides the federation layer that makes multi-cloud analytics practical.&lt;/p&gt;
&lt;p&gt;The combination of Reflections (eliminating repetitive per-TB charges), federation (joining BigQuery with non-Google sources without ETL), and AI capabilities (Agent, MCP Server, SQL Functions) transforms BigQuery from an isolated Google Cloud analytics tool into a connected node in your broader data ecosystem. Your marketing team asks the AI Agent questions about campaign performance and gets accurate answers drawn from BigQuery data enriched with context from your semantic layer.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-google-bigquery-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your BigQuery projects.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Amazon Redshift to Dremio Cloud: Extend Your Warehouse with Federation and AI Analytics</title><link>https://iceberglakehouse.com/posts/2026-03-connector-amazon-redshift/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-amazon-redshift/</guid><description>
Amazon Redshift is AWS&apos;s managed data warehouse, designed for petabyte-scale analytics. If your organization chose Redshift for analytical workloads,...</description><pubDate>Sun, 01 Mar 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Amazon Redshift is AWS&apos;s managed data warehouse, designed for petabyte-scale analytics. If your organization chose Redshift for analytical workloads, you&apos;ve built data pipelines, ETL jobs, and dashboards around it. But as data ecosystems grow, Redshift&apos;s limitations become painfully clear: connecting data outside Redshift requires ETL or Redshift Spectrum (additional cost per TB scanned), sharing Redshift data with non-AWS tools means exporting to S3, and Redshift&apos;s concurrency limits constrain how many dashboards and users can query simultaneously.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Redshift and queries it alongside every other data source in your organization. Instead of moving all your data into Redshift, or exporting Redshift data out, Dremio federates across sources and accelerates repeated queries with Reflections so your Redshift cluster handles less load.&lt;/p&gt;
&lt;p&gt;Redshift&apos;s concurrency scaling feature helps handle burst query volumes, but it charges per-second of additional cluster time. By routing repeated dashboard queries through Dremio Reflections, you reduce the need for concurrency scaling entirely : cached results are served without any Redshift cluster involvement. This difference is particularly impactful for organizations running dozens of auto-refreshing dashboards.&lt;/p&gt;
&lt;h3&gt;Redshift Data Sharing vs. Dremio Federation&lt;/h3&gt;
&lt;p&gt;Redshift Data Sharing allows sharing data between Redshift clusters. But it only works within the Redshift ecosystem : you can&apos;t share Redshift data with Snowflake, BigQuery, or PostgreSQL through Data Sharing. Dremio&apos;s federation provides a broader solution: join Redshift data with any connected source. Data Sharing works for Redshift-to-Redshift use cases; Dremio handles everything else.&lt;/p&gt;
&lt;h3&gt;Redshift Serverless Consideration&lt;/h3&gt;
&lt;p&gt;With Redshift Serverless, you pay per RPU-second consumed. Every query, including repeated dashboard queries, consumes RPUs. Dremio Reflections eliminate RPU consumption for cached queries : a direct and measurable cost reduction. For Serverless users, the ROI from Reflections is immediately visible in the AWS billing dashboard.&lt;/p&gt;
&lt;p&gt;Redshift&apos;s RA3 instances introduced compute-storage separation using Managed Storage backed by S3. While this improved scalability, all queries still consume RA3 compute resources. Dremio provides a complementary compute layer: Reflections handle repetitive analytical workloads while RA3 focuses on the data transformations and ingestion pipelines that require Redshift&apos;s native capabilities. This architectural separation :  Redshift for data engineering, Dremio for analytics serving ,  maximizes the value of both platforms.&lt;/p&gt;
&lt;h2&gt;Why Redshift Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Extend Redshift Without Spectrum Costs&lt;/h3&gt;
&lt;p&gt;Redshift Spectrum charges per TB scanned against S3. Dremio&apos;s federation queries S3 data directly through its own engine without per-TB charges. You still get SQL joins between Redshift and S3 data : Dremio handles the federation transparently.&lt;/p&gt;
&lt;h3&gt;Reduce Redshift Cluster Costs&lt;/h3&gt;
&lt;p&gt;Redshift pricing scales with cluster size (RA3, DC2, or Serverless credits). Analytical dashboards that run the same queries repeatedly consume cluster resources on every refresh. Dremio&apos;s Reflections serve cached results for matching queries, offloading that load from Redshift. For organizations with heavy dashboard workloads, this can reduce the Redshift cluster size needed.&lt;/p&gt;
&lt;h3&gt;Multi-Warehouse Federation&lt;/h3&gt;
&lt;p&gt;Your Redshift warehouse holds sales data, but your Snowflake instance has marketing data, your BigQuery project has Google Analytics data, and your PostgreSQL database has CRM data. Dremio federates across all four in a single query.&lt;/p&gt;
&lt;h3&gt;External Queries&lt;/h3&gt;
&lt;p&gt;Dremio supports external queries against Redshift, allowing you to run Redshift-native SQL (including Redshift-specific functions like &lt;code&gt;APPROXIMATE COUNT(DISTINCT)&lt;/code&gt;, window functions, and late binding views) through Dremio when needed.&lt;/p&gt;
&lt;h3&gt;AI Analytics&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s semantic layer, AI Agent, MCP Server, and AI SQL Functions add natural language querying and AI enrichment to Redshift data without building a separate BI layer.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Redshift cluster endpoint&lt;/strong&gt; (hostname) : from the Redshift console&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; : default &lt;code&gt;5439&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; : your Redshift database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : Redshift database user with SELECT permissions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : Redshift cluster must be publicly accessible, or configure VPC peering with Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-redshift-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Redshift to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;Amazon Redshift&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;redshift-warehouse&lt;/code&gt; or &lt;code&gt;sales-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your Redshift cluster endpoint (e.g., &lt;code&gt;mycluster.xxxx.us-east-1.redshift.amazonaws.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;5439&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; Your Redshift database name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Master Credentials (username/password) or Secret Resource URL (AWS Secrets Manager).&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Redshift&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query Redshift Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  date_trunc(&apos;month&apos;, sale_date) AS month,
  product_category,
  SUM(revenue) AS monthly_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers,
  ROUND(SUM(revenue) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer
FROM &amp;quot;redshift-warehouse&amp;quot;.public.sales
WHERE sale_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY 1, 2
ORDER BY 1, monthly_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;External Queries&lt;/h2&gt;
&lt;p&gt;Run Redshift-native SQL through Dremio when you need Redshift-specific functions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(
  &amp;quot;redshift-warehouse&amp;quot;.EXTERNAL_QUERY(
    &apos;SELECT TOP 100 querytxt, elapsed, starttime FROM stl_query WHERE starttime &amp;gt; GETDATE() - 7 ORDER BY elapsed DESC&apos;
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate Redshift with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Redshift sales with PostgreSQL CRM and S3 marketing data
SELECT
  c.customer_name,
  c.segment,
  SUM(s.revenue) AS total_revenue,
  COUNT(s.sale_id) AS total_sales,
  m.campaign_name,
  m.attribution_channel,
  ROUND(SUM(s.revenue) / NULLIF(m.campaign_spend, 0), 2) AS roas
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
JOIN &amp;quot;redshift-warehouse&amp;quot;.public.sales s ON c.customer_id = s.customer_id
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.attribution.customer_campaigns m ON c.customer_id = m.customer_id
WHERE s.sale_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY c.customer_name, c.segment, m.campaign_name, m.attribution_channel, m.campaign_spend
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.sales_performance AS
SELECT
  s.product_category,
  date_trunc(&apos;month&apos;, s.sale_date) AS month,
  SUM(s.revenue) AS revenue,
  COUNT(*) AS transactions,
  COUNT(DISTINCT s.customer_id) AS unique_buyers,
  ROUND(SUM(s.revenue) / COUNT(*), 2) AS avg_transaction_value,
  CASE
    WHEN SUM(s.revenue) &amp;gt; 500000 THEN &apos;Top Performer&apos;
    WHEN SUM(s.revenue) &amp;gt; 100000 THEN &apos;Solid&apos;
    ELSE &apos;Emerging&apos;
  END AS performance_tier
FROM &amp;quot;redshift-warehouse&amp;quot;.public.sales s
GROUP BY s.product_category, date_trunc(&apos;month&apos;, s.sale_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Redshift Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask &amp;quot;What were our top performing product categories last quarter?&amp;quot; and generates accurate SQL from your semantic layer. The wiki descriptions attached to views tell the Agent what &amp;quot;top performing&amp;quot; means in your data context.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your Redshift data through Dremio:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A VP of Sales asks Claude &amp;quot;Compare our Q1 revenue per customer across product categories using the Redshift data&amp;quot; and gets a governed, accurate benchmark without SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate strategic recommendations from sales data
SELECT
  product_category,
  revenue,
  performance_tier,
  AI_GENERATE(
    &apos;Write a strategic recommendation for this product category&apos;,
    &apos;Category: &apos; || product_category || &apos;, Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Tier: &apos; || performance_tier || &apos;, Avg Transaction: $&apos; || CAST(avg_transaction_value AS VARCHAR)
  ) AS strategic_recommendation
FROM analytics.gold.sales_performance
WHERE month = DATE_TRUNC(&apos;month&apos;, CURRENT_DATE - INTERVAL &apos;1&apos; MONTH);

-- Classify product categories for budget allocation
SELECT
  product_category,
  AI_CLASSIFY(
    &apos;Based on this sales performance, classify the marketing budget priority&apos;,
    &apos;Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Customers: &apos; || CAST(unique_buyers AS VARCHAR) || &apos;, Avg Transaction: $&apos; || CAST(avg_transaction_value AS VARCHAR),
    ARRAY[&apos;Increase Investment&apos;, &apos;Maintain Investment&apos;, &apos;Optimize Spend&apos;, &apos;Reduce Budget&apos;]
  ) AS budget_recommendation
FROM analytics.gold.sales_performance
WHERE month &amp;gt;= DATE_TRUNC(&apos;month&apos;, CURRENT_DATE - INTERVAL &apos;3&apos; MONTH);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;Create Reflections on Redshift views for dashboard acceleration:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full dataset cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed SUM/COUNT/AVG)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : balance freshness against Redshift cluster load&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected via Arrow Flight get sub-second responses from Reflections instead of waiting for Redshift cluster processing. A Tableau dashboard refreshing every 15 minutes generates zero Redshift cluster load after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance Across Redshift and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance capabilities that work uniformly across Redshift and every other connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive revenue data or PII from specific roles. A marketing analyst sees conversion rates but not individual customer records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data visibility based on user roles. Regional managers see only their region&apos;s sales data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance applies to Redshift, PostgreSQL, S3, and all other sources : no per-database security configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across all access methods: SQL Runner, BI tools, AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC for BI tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer : whether the underlying data comes from Redshift or any other source.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration enables developers to query Redshift data from their IDE. Ask Copilot &amp;quot;Show me sales performance by category from Redshift&amp;quot; and it generates SQL using your semantic layer, eliminating context switching between tools.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Redshift vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Redshift:&lt;/strong&gt; Data actively used by Redshift-native tools and materializations, workloads with existing Redshift-based ETL pipelines, datasets managed by Redshift&apos;s automatic table optimization (sort keys, distribution styles).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical data and archives, datasets consumed primarily by non-Redshift tools, data where Redshift cluster costs exceed analytical value. Migrated Iceberg tables benefit from Dremio&apos;s automatic compaction, time travel, Autonomous Reflections, and zero per-query storage costs.&lt;/p&gt;
&lt;p&gt;For data that stays in Redshift, create manual Reflections to reduce cluster load. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Redshift Cost Optimization with Dremio&lt;/h2&gt;
&lt;h3&gt;Redshift Pricing Models&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;How It&apos;s Priced&lt;/th&gt;
&lt;th&gt;Dremio&apos;s Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RA3 Provisioned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-node-hour + managed storage&lt;/td&gt;
&lt;td&gt;Reflections reduce node utilization, enabling cluster downsizing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DC2 Provisioned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-node-hour, SSD storage included&lt;/td&gt;
&lt;td&gt;Same as RA3 : lower utilization means fewer nodes needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Serverless&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per RPU-hour (compute consumed)&lt;/td&gt;
&lt;td&gt;Reflections eliminate RPU consumption for cached queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spectrum&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per TB scanned in S3&lt;/td&gt;
&lt;td&gt;Dremio queries S3 directly without per-TB charges&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Quantifying Savings&lt;/h3&gt;
&lt;p&gt;A typical dashboard workload might include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;20 production dashboards, each refreshing every 15 minutes&lt;/li&gt;
&lt;li&gt;50+ ad-hoc queries per day from analysts&lt;/li&gt;
&lt;li&gt;Weekly scheduled reports generating 100+ queries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With Dremio Reflections, only the Reflection refresh queries hit Redshift. If Reflections refresh hourly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dashboard queries drop from 1,920/day to 24/day (hourly Reflection refresh × 24 hours) : a &lt;strong&gt;98.7% reduction&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Ad-hoc queries matching Reflection patterns are served from cache : zero Redshift load&lt;/li&gt;
&lt;li&gt;Scheduled reports matching Reflections run instantly&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Migration Strategy&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Assess:&lt;/strong&gt; Identify Redshift tables by query frequency and size. High-frequency, read-heavy tables are prime candidates for Reflections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accelerate:&lt;/strong&gt; Create Reflections on the 10-20 most-queried views. Monitor Redshift cluster utilization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Right-size:&lt;/strong&gt; As utilization drops, reduce Redshift node count or switch from Provisioned to Serverless.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate:&lt;/strong&gt; Move historical and archival data from Redshift to Iceberg tables. Use &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt; in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimize:&lt;/strong&gt; Continue moving more tables as Redshift costs decrease and Dremio handles more workloads.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Redshift users can extend their warehouse with federation, reduce cluster costs with Reflections, add AI analytics, and apply unified governance across their entire data estate. Whether you&apos;re running Redshift Provisioned, Serverless, or RA3, Dremio Reflections immediately reduce compute costs by caching repetitive queries. Start by connecting your cluster and creating Reflections on your most-queried views to see immediate results.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-redshift-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Redshift cluster.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Azure Storage to Dremio Cloud: Query Your Microsoft Data Lake with SQL and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-azure-storage/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-azure-storage/</guid><description>
Azure Storage is Microsoft&apos;s cloud storage platform, spanning Blob Storage, Azure Data Lake Storage Gen2 (ADLS Gen2), and Azure Files. If your organi...</description><pubDate>Sun, 01 Mar 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Azure Storage is Microsoft&apos;s cloud storage platform, spanning Blob Storage, Azure Data Lake Storage Gen2 (ADLS Gen2), and Azure Files. If your organization uses Microsoft Azure, your data lake almost certainly lives in Azure Storage : Parquet files from Azure Data Factory pipelines, CSV exports from Azure SQL Database, JSON event streams from Azure Event Hubs, and raw data from Azure IoT Hub all land in Azure Storage containers.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to Azure Storage and lets you query these files in place using standard SQL. You don&apos;t need Azure Synapse Analytics (DWU-based pricing), Azure Databricks (DBU costs), or HDInsight (cluster management) to run analytical queries against your data lake. Dremio reads the data, accelerates repeated queries with Reflections, and federates Azure Storage with every other source in your data ecosystem.&lt;/p&gt;
&lt;p&gt;Many Azure customers face a fragmented analytics experience: Synapse for warehouse workloads, Databricks for data engineering, Power BI for visualization, and Azure Data Explorer for log analytics : each with its own pricing model, access control, and query interface. Dremio consolidates the analytical layer by querying Azure Storage and other Azure (or non-Azure) services from a single SQL engine with unified governance and AI capabilities. Dremio reads Parquet, CSV, JSON, Delta Lake, and Apache Iceberg table formats from Azure Blob Storage and ADLS Gen2 containers. It pushes projection and filtering into its vectorized query engine and caches frequently accessed data on local NVMe drives (Columnar Cloud Cache, or C3) for near-instantaneous repeat queries.&lt;/p&gt;
&lt;h2&gt;Why Azure Storage Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;SQL Without Azure Synapse Costs&lt;/h3&gt;
&lt;p&gt;Azure Synapse serverless SQL charges per terabyte of data processed. For large datasets queried frequently :  dashboard refreshes, ad-hoc exploration, scheduled reports ,  costs accumulate quickly. Dremio&apos;s Reflections eliminate repeat scans by caching pre-computed results. C3 caching further reduces Azure Storage API calls for frequently accessed files. Your first query scans Azure Storage; subsequent matching queries hit Dremio&apos;s cache.&lt;/p&gt;
&lt;h3&gt;Federation Beyond Azure&lt;/h3&gt;
&lt;p&gt;Your Azure data lake holds event data and ETL outputs, but your operational database is in PostgreSQL on AWS, your marketing data is in Google BigQuery, and your CRM is in Salesforce (exported to S3). Dremio federates across all three cloud providers in a single SQL query : no ADF (Azure Data Factory) pipelines needed.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg Table Management&lt;/h3&gt;
&lt;p&gt;Create Iceberg tables backed by Azure Storage (or Dremio-managed storage) with full DML support (INSERT, UPDATE, DELETE, MERGE). Dremio automatically handles compaction, manifest rewriting, clustering, and vacuuming. No manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs, no maintenance scripts.&lt;/p&gt;
&lt;h3&gt;AI on Azure Data&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions make your Azure data queryable by non-technical users and external AI tools. Build a semantic layer over your Azure files, and let AI do the heavy lifting.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Azure Storage Account&lt;/strong&gt; with Blob Storage or ADLS Gen2 enabled&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; Azure Active Directory OAuth 2.0, Shared Access Key, or Shared Access Signature (SAS) Token&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container names&lt;/strong&gt; you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; from Dremio Cloud to Azure Storage endpoints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-storage-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Azure Storage to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;Azure Storage&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;azure-datalake&lt;/code&gt; or &lt;code&gt;adls-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Account:&lt;/strong&gt; Your Azure Storage account name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Azure AD (OAuth 2.0):&lt;/strong&gt; Most secure, uses service principal or managed identity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shared Access Key:&lt;/strong&gt; Full access to the storage account. Simpler but less granular.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SAS Token:&lt;/strong&gt; Scoped, time-limited access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Root Path&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starting container/path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/&lt;/code&gt; (all containers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CTAS Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Default CREATE TABLE format&lt;/td&gt;
&lt;td&gt;Iceberg recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable HTTPS&lt;/td&gt;
&lt;td&gt;On&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable partition column inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extract partition keys from folder structures&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable file status check&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verify file existence before reads&lt;/td&gt;
&lt;td&gt;On&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query Azure Storage Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query Parquet files directly
SELECT transaction_id, customer_id, amount, transaction_date
FROM &amp;quot;azure-datalake&amp;quot;.sales.&amp;quot;transactions.parquet&amp;quot;
WHERE transaction_date &amp;gt;= &apos;2024-01-01&apos; AND amount &amp;gt; 100
ORDER BY amount DESC;

-- Query partitioned data (Hive-style partitions)
SELECT region, product_category, SUM(revenue) AS total_revenue
FROM &amp;quot;azure-datalake&amp;quot;.sales.transactions
WHERE year = &apos;2024&apos; AND quarter = &apos;Q1&apos;
GROUP BY region, product_category
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Other Clouds&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Azure data with AWS and Google Cloud sources
SELECT
  c.customer_name,
  c.segment,
  SUM(a.amount) AS azure_revenue,
  COUNT(s.event_id) AS aws_events,
  bq.campaign_clicks
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
LEFT JOIN &amp;quot;azure-datalake&amp;quot;.sales.transactions a ON c.customer_id = a.customer_id
LEFT JOIN &amp;quot;s3-events&amp;quot;.analytics.user_events s ON c.customer_id = s.user_id
LEFT JOIN &amp;quot;bigquery-marketing&amp;quot;.analytics.customer_clicks bq ON c.customer_id = bq.user_id
GROUP BY c.customer_name, c.segment, bq.campaign_clicks
ORDER BY azure_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Four clouds, one query.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_transactions AS
SELECT
  a.customer_id,
  a.transaction_date,
  a.amount,
  CASE
    WHEN a.amount &amp;gt; 1000 THEN &apos;High Value&apos;
    WHEN a.amount &amp;gt; 100 THEN &apos;Standard&apos;
    ELSE &apos;Micro&apos;
  END AS transaction_tier,
  DATE_TRUNC(&apos;month&apos;, a.transaction_date) AS transaction_month
FROM &amp;quot;azure-datalake&amp;quot;.sales.transactions a
WHERE a.transaction_date &amp;gt;= &apos;2024-01-01&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Azure Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask questions in plain English: &amp;quot;What&apos;s our total revenue from high-value transactions this quarter?&amp;quot; The AI Agent reads your wiki descriptions and generates accurate SQL against your Azure data.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your Azure data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;An operations team member can ask Claude &amp;quot;Show me a summary of our Azure sales data by region this month&amp;quot; : no SQL required.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify transactions with AI
SELECT
  transaction_id,
  amount,
  AI_CLASSIFY(
    &apos;Based on this transaction, classify the likely purchase category&apos;,
    &apos;Amount: $&apos; || CAST(amount AS VARCHAR) || &apos;, Date: &apos; || CAST(transaction_date AS VARCHAR),
    ARRAY[&apos;Subscription&apos;, &apos;One-Time Purchase&apos;, &apos;Refund&apos;, &apos;Upgrade&apos;]
  ) AS inferred_category
FROM &amp;quot;azure-datalake&amp;quot;.sales.transactions
WHERE transaction_date = CURRENT_DATE;

-- Generate data quality summaries
SELECT
  transaction_month,
  COUNT(*) AS total_transactions,
  AI_GENERATE(
    &apos;Write a one-sentence summary of this month data quality&apos;,
    &apos;Transactions: &apos; || CAST(COUNT(*) AS VARCHAR) || &apos;, Avg Amount: $&apos; || CAST(ROUND(AVG(amount), 2) AS VARCHAR) || &apos;, Nulls: &apos; || CAST(SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS VARCHAR)
  ) AS quality_summary
FROM analytics.gold.customer_transactions
GROUP BY transaction_month;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Create Iceberg Tables from Azure Data&lt;/h2&gt;
&lt;p&gt;Promote raw Azure files into managed Iceberg tables with full ACID transaction support:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.bronze.azure_events AS
SELECT event_type, user_id, CAST(event_timestamp AS TIMESTAMP) AS event_time, payload
FROM &amp;quot;azure-datalake&amp;quot;.events.&amp;quot;raw_events.parquet&amp;quot;
WHERE event_type IS NOT NULL;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Iceberg tables benefit from automatic compaction, time travel, results caching, and Autonomous Reflections. You can also use time travel to query historical states:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query as table existed 7 days ago
SELECT * FROM analytics.bronze.azure_events
AT TIMESTAMP &apos;2024-06-01 00:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Governance on Azure Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance capabilities that Azure Storage doesn&apos;t provide natively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask PII fields (email, IP address, user ID) from specific roles. Marketing analysts see aggregated metrics but not individual user data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically filter data by the querying user&apos;s role. A regional manager sees only their region&apos;s Azure data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies whether data comes from Azure Storage, PostgreSQL, BigQuery, or any other connected source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (via Arrow Flight/ODBC), AI Agent queries, and MCP Server interactions.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer than JDBC/ODBC. After building views over your Azure data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s native connector or ODBC driver : ideal for Azure-centric organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for high-speed data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for SQL-based transformations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Azure data directly from their IDE. Ask Copilot &amp;quot;Show me daily transaction trends from Azure storage&amp;quot; and it generates SQL using your semantic layer : without leaving your development environment.&lt;/p&gt;
&lt;h2&gt;Reflections and C3 Caching&lt;/h2&gt;
&lt;p&gt;For frequently queried Azure Storage data, create Reflections to pre-compute results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the Catalog&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab and create a Raw or Aggregation Reflection&lt;/li&gt;
&lt;li&gt;Select columns and set the refresh interval&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;C3 (Columnar Cloud Cache) automatically caches frequently accessed file data on local NVMe drives for sub-second access. You don&apos;t configure C3 manually : it works transparently.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Azure Storage vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep as raw files:&lt;/strong&gt; Data landing zones for Azure Data Factory, files consumed by Azure-native services (Databricks, Synapse, Azure ML), raw data in formats required by other tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg tables:&lt;/strong&gt; Analytical datasets consumed by SQL queries, data that benefits from ACID transactions and time travel, historical data needing snapshot management, datasets consumed by BI tools and AI agents.&lt;/p&gt;
&lt;p&gt;For raw Azure files, query through the connector and create manual Reflections. For Iceberg tables (either in Dremio&apos;s Open Catalog or external catalogs), Dremio provides automated compaction, Autonomous Reflections, and zero-maintenance performance optimization.&lt;/p&gt;
&lt;h2&gt;Azure Storage Tiers and Dremio Performance&lt;/h2&gt;
&lt;p&gt;Azure Storage offers multiple access tiers that affect query performance:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Access Latency&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Dremio Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Highest storage, lowest access&lt;/td&gt;
&lt;td&gt;Active analytics data : best performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Lower storage, higher access&lt;/td&gt;
&lt;td&gt;Infrequent queries : still fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Even lower storage, higher access&lt;/td&gt;
&lt;td&gt;Archival analytics : acceptable latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Archive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hours (rehydrate required)&lt;/td&gt;
&lt;td&gt;Lowest storage, highest access&lt;/td&gt;
&lt;td&gt;Not suitable for Dremio queries : rehydrate first&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For optimal Dremio performance, keep analytical data in Hot or Cool tiers. Use Azure lifecycle management policies to automatically transition data between tiers based on last access time.&lt;/p&gt;
&lt;h2&gt;ADLS Gen2 vs. Blob Storage&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Azure Storage connector supports both Azure Data Lake Storage Gen2 (ADLS Gen2) and Azure Blob Storage:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ADLS Gen2&lt;/strong&gt; is the recommended option for analytical workloads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hierarchical namespace enables true directory operations (faster metadata operations)&lt;/li&gt;
&lt;li&gt;Fine-grained POSIX-like permissions for directory and file-level access&lt;/li&gt;
&lt;li&gt;Optimized for large-scale analytics workloads&lt;/li&gt;
&lt;li&gt;Required for Iceberg table creation and Azure Synapse integration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Azure Blob Storage&lt;/strong&gt; works for read-only access to existing file datasets but lacks hierarchical namespace features.&lt;/p&gt;
&lt;p&gt;When creating your Azure Storage source in Dremio, specify the storage account and container. For ADLS Gen2 accounts, Dremio automatically uses the &lt;code&gt;abfss://&lt;/code&gt; protocol for optimized access.&lt;/p&gt;
&lt;h2&gt;Azure-Specific Integration Patterns&lt;/h2&gt;
&lt;h3&gt;Azure Data Factory + Dremio&lt;/h3&gt;
&lt;p&gt;Azure Data Factory (ADF) lands data into Azure Storage containers. Dremio queries this data in place:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;ADF pipelines extract from Azure SQL, Cosmos DB, or external APIs&lt;/li&gt;
&lt;li&gt;ADF writes Parquet files to ADLS Gen2 containers&lt;/li&gt;
&lt;li&gt;Dremio queries the Parquet files via the Azure Storage connector&lt;/li&gt;
&lt;li&gt;Dremio creates Iceberg tables from the Parquet data for optimized analytics&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Azure Synapse + Dremio + Azure Storage&lt;/h3&gt;
&lt;p&gt;Connect both Azure Synapse and Azure Storage to Dremio Cloud. Dremio federates data across both:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Synapse contains summarized, modeled data&lt;/li&gt;
&lt;li&gt;Azure Storage contains raw files and Iceberg tables&lt;/li&gt;
&lt;li&gt;Dremio joins both sources in a single query&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This eliminates the need to load all Azure Storage data into Synapse, reducing Synapse DWU consumption and costs.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Azure Storage users can query their cloud data lake with SQL, federate with other sources, build a semantic layer, and enable AI analytics : all without data movement or ETL pipelines.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-storage-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Azure Storage accounts alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Amazon S3 to Dremio Cloud: Query Your Data Lake with SQL, Federation, and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-amazon-s3/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-amazon-s3/</guid><description>
Amazon S3 is the default landing zone for data in the cloud. Log files, Parquet datasets, CSV exports, JSON events, IoT telemetry, and raw data dumps...</description><pubDate>Sun, 01 Mar 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Amazon S3 is the default landing zone for data in the cloud. Log files, Parquet datasets, CSV exports, JSON events, IoT telemetry, and raw data dumps : it all ends up in S3 buckets. But S3 is storage, not an analytics engine. You can&apos;t run SQL against S3 natively. To query it, you need Amazon Athena (per-TB pricing), AWS Glue ETL jobs (cluster management), or a data warehouse that imports the data. All add cost, complexity, and latency.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to S3 and lets you query files in place using standard SQL. Dremio reads Parquet, CSV, JSON, Delta Lake, and Apache Iceberg table formats. It pushes projection and filter operations into its vectorized query engine and caches frequently accessed data on local NVMe drives (Columnar Cloud Cache, or C3) for near-instantaneous repeat queries.&lt;/p&gt;
&lt;p&gt;For organizations with hundreds or thousands of S3 buckets accumulated over years, data lake sprawl is a major challenge. Data lands in S3 from application logs, CDC pipelines, third-party integrations, and manual uploads : often without consistent naming conventions, schemas, or documentation. Dremio provides the organizational layer: connect S3 buckets, create views that standardize column names and types, build a semantic layer with wiki descriptions, and expose clean datasets to analysts and AI tools. This turns an unstructured &amp;quot;data swamp&amp;quot; into a governed, queryable data lake.&lt;/p&gt;
&lt;h2&gt;Why S3 Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;SQL on Your Data Lake Without Athena Costs&lt;/h3&gt;
&lt;p&gt;Athena charges per terabyte of data scanned. For large datasets queried frequently :  dashboards refreshing every 15 minutes, analysts exploring data, scheduled reports ,  costs grow unpredictably. Dremio&apos;s Reflections pre-compute results so repeated queries don&apos;t re-scan S3. C3 caching further reduces S3 GET requests. You pay for Dremio compute time, not per-TB scanned.&lt;/p&gt;
&lt;h3&gt;Format Flexibility&lt;/h3&gt;
&lt;p&gt;Dremio reads Parquet, CSV, JSON, Avro, Delta Lake, and Apache Iceberg from S3. You don&apos;t need to convert everything to one format before querying. Mixed-format data lakes work out of the box.&lt;/p&gt;
&lt;h3&gt;Federation with Databases and Warehouses&lt;/h3&gt;
&lt;p&gt;Your event data is in S3, but your customer data is in PostgreSQL, your financial data is in Snowflake, and your marketing data is in BigQuery. Dremio joins across all of them in a single SQL query without copying data between systems.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg Table Management&lt;/h3&gt;
&lt;p&gt;Create Iceberg tables in Dremio&apos;s Open Catalog (backed by S3 or Dremio-managed storage) with full DML support. Dremio automatically handles compaction (merging small files), manifest rewriting, clustering, and vacuuming : no manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs needed.&lt;/p&gt;
&lt;h3&gt;AI on S3 Data&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions make your raw S3 files queryable by business users and external AI tools : no data engineering required.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AWS Account&lt;/strong&gt; with S3 access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IAM Role or Access Key/Secret Key&lt;/strong&gt; with &lt;code&gt;s3:GetObject&lt;/code&gt;, &lt;code&gt;s3:ListBucket&lt;/code&gt;, and &lt;code&gt;s3:GetBucketLocation&lt;/code&gt; permissions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bucket names&lt;/strong&gt; or specific paths you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-s3-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect S3 to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Amazon S3&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;s3-datalake&lt;/code&gt; or &lt;code&gt;event-logs&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; IAM Role ARN (recommended) or Access Key/Secret Key.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Root Path&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starting path in the bucket&lt;/td&gt;
&lt;td&gt;Restrict to subfolder: &lt;code&gt;/data/analytics/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Allowlisted Buckets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limit which buckets appear&lt;/td&gt;
&lt;td&gt;Multi-bucket accounts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable partition column inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extract partition keys from folders&lt;/td&gt;
&lt;td&gt;Hive-style partitioned data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default CTAS Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CREATE TABLE format&lt;/td&gt;
&lt;td&gt;Iceberg recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL&lt;/td&gt;
&lt;td&gt;Always recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requester Pays&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;For requester-pays buckets&lt;/td&gt;
&lt;td&gt;Cross-account access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable compatibility mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3-compatible storage&lt;/td&gt;
&lt;td&gt;MinIO, R2, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom settings&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fs.s3a.endpoint&lt;/code&gt; for non-AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;4. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query S3 Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query Parquet files
SELECT event_type, user_id, event_timestamp, page_url
FROM &amp;quot;s3-datalake&amp;quot;.events.&amp;quot;user_events.parquet&amp;quot;
WHERE event_type = &apos;purchase&apos; AND event_timestamp &amp;gt; &apos;2024-01-01&apos;;

-- Query partitioned data (e.g., year=2024/month=01/)
SELECT region, product_category, SUM(revenue) AS total_revenue
FROM &amp;quot;s3-datalake&amp;quot;.sales.transactions
GROUP BY region, product_category
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate S3 with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  c.customer_name,
  c.segment,
  COUNT(e.event_id) AS s3_events,
  SUM(CASE WHEN e.event_type = &apos;purchase&apos; THEN e.revenue ELSE 0 END) AS s3_revenue,
  pg.lifetime_value AS crm_lifetime_value
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
LEFT JOIN &amp;quot;s3-datalake&amp;quot;.events.user_events e ON c.customer_id = e.user_id
LEFT JOIN &amp;quot;postgres-crm&amp;quot;.public.customer_metrics pg ON c.customer_id = pg.customer_id
GROUP BY c.customer_name, c.segment, pg.lifetime_value
ORDER BY s3_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Create Iceberg Tables from S3 Data&lt;/h2&gt;
&lt;p&gt;Promote raw S3 files into managed Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.bronze.clean_events AS
SELECT event_type, user_id, CAST(event_timestamp AS TIMESTAMP) AS event_time, page_url, revenue
FROM &amp;quot;s3-datalake&amp;quot;.events.&amp;quot;user_events.parquet&amp;quot;
WHERE event_type IS NOT NULL;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Iceberg table benefits from automatic compaction, time travel, results caching, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;S3-Compatible Storage&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s S3 connector works with S3-compatible storage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MinIO:&lt;/strong&gt; Enable compatibility mode, set &lt;code&gt;fs.s3a.endpoint&lt;/code&gt; to your MinIO endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloudflare R2:&lt;/strong&gt; Same pattern, with R2&apos;s S3-compatible endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DigitalOcean Spaces:&lt;/strong&gt; Compatibility mode + custom endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon FSx for NetApp ONTAP:&lt;/strong&gt; Set the S3 Access Point alias as the root path, ensure IAM permissions include FSx-specific actions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.event_metrics AS
SELECT
  DATE_TRUNC(&apos;day&apos;, CAST(event_timestamp AS TIMESTAMP)) AS event_date,
  event_type,
  COUNT(*) AS event_count,
  COUNT(DISTINCT user_id) AS unique_users,
  SUM(revenue) AS daily_revenue,
  CASE
    WHEN COUNT(*) &amp;gt; 10000 THEN &apos;High Activity&apos;
    WHEN COUNT(*) &amp;gt; 1000 THEN &apos;Normal Activity&apos;
    ELSE &apos;Low Activity&apos;
  END AS activity_level
FROM &amp;quot;s3-datalake&amp;quot;.events.user_events
GROUP BY 1, 2;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on S3 Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask &amp;quot;What&apos;s our daily purchase revenue trend this month?&amp;quot; and the AI Agent generates SQL from your semantic layer. The wiki descriptions guide the Agent&apos;s understanding of event types and metrics.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your S3 data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A product analyst asks Claude &amp;quot;Analyze user engagement patterns from S3 event data this week&amp;quot; and gets governed results.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify events with AI
SELECT
  event_type,
  event_count,
  AI_CLASSIFY(
    &apos;Based on this event pattern, classify the business impact&apos;,
    &apos;Event: &apos; || event_type || &apos;, Count: &apos; || CAST(event_count AS VARCHAR) || &apos;, Revenue: $&apos; || CAST(daily_revenue AS VARCHAR),
    ARRAY[&apos;Revenue Driver&apos;, &apos;Engagement Signal&apos;, &apos;Support Indicator&apos;, &apos;Churn Signal&apos;]
  ) AS business_impact
FROM analytics.gold.event_metrics
WHERE event_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY;

-- Process unstructured data from S3
SELECT
  file[&apos;path&apos;] AS file_path,
  AI_GENERATE(
    &apos;Extract key information from this document&apos;,
    (&apos;Summarize the main topics in this file&apos;, file)
    WITH SCHEMA ROW(summary VARCHAR, category VARCHAR)
  ) AS extracted_info
FROM TABLE(LIST_FILES(&apos;@&amp;quot;s3-datalake&amp;quot;/documents/&apos;))
WHERE file[&apos;path&apos;] LIKE &apos;%.pdf&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_GENERATE&lt;/code&gt; with file references can process unstructured documents (PDFs, images) stored in S3 directly in SQL queries.&lt;/p&gt;
&lt;h2&gt;Reflections and C3 Caching&lt;/h2&gt;
&lt;p&gt;For frequently queried S3 data, Dremio provides two layers of acceleration:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reflections&lt;/strong&gt; pre-compute query results. Create them on your semantic layer views:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, navigate to the view&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose Raw or Aggregation Reflections&lt;/li&gt;
&lt;li&gt;Select columns and set the refresh interval&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;C3 (Columnar Cloud Cache)&lt;/strong&gt; automatically caches frequently accessed file data on local NVMe drives. C3 works transparently : no configuration needed. When Dremio reads S3 files, it caches the columnar data locally. Subsequent reads of the same files come from NVMe instead of S3, eliminating S3 GET request costs and latency.&lt;/p&gt;
&lt;p&gt;Together, Reflections and C3 mean that frequently executed queries against S3 data run in milliseconds, not seconds.&lt;/p&gt;
&lt;h2&gt;Governance on S3 Data&lt;/h2&gt;
&lt;p&gt;S3 has bucket-level IAM policies, but no column-level masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask PII fields (email, IP, user ID) from specific roles. Data engineers see everything; marketing analysts see aggregated metrics only.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by user role. Regional analysts see only their region&apos;s events.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across S3, PostgreSQL, Snowflake, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot lets developers query S3 data from their IDE. Ask Copilot &amp;quot;Show me purchase event trends from S3 data this week&amp;quot; and get SQL generated using your semantic layer.&lt;/p&gt;
&lt;h2&gt;S3 Data Organization Best Practices&lt;/h2&gt;
&lt;p&gt;How you organize data in S3 directly impacts Dremio&apos;s query performance:&lt;/p&gt;
&lt;h3&gt;Partition Strategy&lt;/h3&gt;
&lt;p&gt;Hive-style partitions (&lt;code&gt;year=2024/month=01/day=15/&lt;/code&gt;) enable Dremio to skip irrelevant partitions during query planning. The right partition key depends on your query patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time-based queries:&lt;/strong&gt; Partition by &lt;code&gt;year/month/day&lt;/code&gt; or &lt;code&gt;year/month&lt;/code&gt;. Dremio reads only the partitions matching your &lt;code&gt;WHERE&lt;/code&gt; clause.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Regional queries:&lt;/strong&gt; Partition by &lt;code&gt;region/date&lt;/code&gt; for multi-region datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mixed access:&lt;/strong&gt; Partition by the most common filter column first (e.g., &lt;code&gt;region/year/month&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Avoid over-partitioning (too many small files per partition) or under-partitioning (too few partitions with huge files). Aim for partition sizes between 128 MB and 1 GB.&lt;/p&gt;
&lt;h3&gt;File Format Recommendations&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Dremio Support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parquet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured analytics data&lt;/td&gt;
&lt;td&gt;Full support, columnar optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ACID transactions, time travel&lt;/td&gt;
&lt;td&gt;Full read/write support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Databricks ecosystem compatibility&lt;/td&gt;
&lt;td&gt;Read support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semi-structured event data&lt;/td&gt;
&lt;td&gt;Full support, schema inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Legacy data imports&lt;/td&gt;
&lt;td&gt;Full support, limited performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Avro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Schema-evolved event streams&lt;/td&gt;
&lt;td&gt;Read support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For analytical workloads, convert CSV and JSON files to Parquet or Iceberg for 10-50x better query performance. Dremio can perform this conversion:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.bronze.events_optimized AS
SELECT * FROM &amp;quot;s3-datalake&amp;quot;.raw.&amp;quot;events.csv&amp;quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates an Iceberg table from CSV data, giving you columnar storage, automatic compaction, and time travel.&lt;/p&gt;
&lt;h3&gt;Data Lake Layers&lt;/h3&gt;
&lt;p&gt;Organize your S3 bucket with a medallion architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;raw/&lt;/code&gt;&lt;/strong&gt; : Landing zone for incoming data (CSV, JSON, Parquet from external sources)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;bronze/&lt;/code&gt;&lt;/strong&gt; : Cleaned, typed versions of raw data (Iceberg tables)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;silver/&lt;/code&gt;&lt;/strong&gt; : Joined, deduplicated, enriched datasets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;gold/&lt;/code&gt;&lt;/strong&gt; : Business-ready views and aggregations for the semantic layer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s SQL engine handles the transformations between layers using &lt;code&gt;CREATE TABLE AS SELECT&lt;/code&gt; and &lt;code&gt;MERGE&lt;/code&gt; statements : no external ETL tools needed.&lt;/p&gt;
&lt;h2&gt;When to Use S3 vs. Other Storage&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use S3 when:&lt;/strong&gt; Your data originates in AWS, you need cost-effective long-term storage, you want to use Apache Iceberg tables, your data is in file formats (Parquet, JSON, CSV).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use managed databases when:&lt;/strong&gt; Your data requires real-time OLTP operations, your applications need row-level transactions, your data model is heavily relational.&lt;/p&gt;
&lt;p&gt;Dremio federates across both : S3 for your data lake and databases for operational data, in a single query.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Amazon S3 is the most common data lake storage layer. Dremio Cloud turns it into a queryable, federated, AI-ready analytics platform without Athena costs or data warehouse ETL. Whether your S3 data is in Parquet, CSV, JSON, or Iceberg format, Dremio reads it directly and makes it available for SQL queries, cross-source joins, and AI-powered analytics.&lt;/p&gt;
&lt;p&gt;Start by connecting your primary S3 bucket to Dremio Cloud. Create views that standardize your data into business-friendly structures, add wiki descriptions for the AI Agent, and build Reflections on frequently accessed datasets. Within hours, your S3 data lake transforms from raw file storage into a governed, AI-ready analytical platform. No infrastructure to manage and no data to move.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-s3-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your S3 buckets.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect SAP HANA to Dremio Cloud: Unlock Analytics Beyond the SAP Ecosystem</title><link>https://iceberglakehouse.com/posts/2026-03-connector-sap-hana/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-sap-hana/</guid><description>
SAP HANA is the in-memory database platform that powers SAP S/4HANA, SAP BW/4HANA, and custom enterprise applications across finance, manufacturing, ...</description><pubDate>Sun, 01 Mar 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;SAP HANA is the in-memory database platform that powers SAP S/4HANA, SAP BW/4HANA, and custom enterprise applications across finance, manufacturing, logistics, and supply chain. It&apos;s fast for SAP-native analytics : real-time financial reporting, material requirements planning, and production analytics run directly on HANA&apos;s in-memory columnar engine. But SAP HANA exists in a walled garden.&lt;/p&gt;
&lt;p&gt;Connecting HANA data to non-SAP tools requires SAP Data Intelligence, SAP Business Technology Platform (BTP), or custom ABAP extractors : all of which add significant cost and complexity. Sharing HANA data with teams that don&apos;t use SAP tools (marketing running Tableau, data science using Python, operations using Power BI) means building export pipelines that duplicate data, add latency, and create governance gaps.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to SAP HANA and queries it alongside your other data sources with standard SQL. No SAP-specific middleware. No data extraction. Your HANA data stays in place and joins with S3, PostgreSQL, BigQuery, Snowflake, or any other connected source in a single SQL query.&lt;/p&gt;
&lt;h2&gt;Why SAP HANA Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Break Out of the SAP Ecosystem&lt;/h3&gt;
&lt;p&gt;SAP Analytics Cloud and SAP BusinessObjects work well with HANA, but connecting HANA data to Tableau, Power BI, Looker, or Python-based analytics requires additional middleware, gateway servers, or data export. Dremio provides a vendor-neutral SQL layer that connects HANA to any BI tool via Arrow Flight (high-performance columnar data transfer) or standard ODBC/JDBC.&lt;/p&gt;
&lt;h3&gt;Cross-Platform Analytics&lt;/h3&gt;
&lt;p&gt;Your SAP data covers finance (GL accounts, AP/AR, cost centers) and supply chain (material masters, purchase orders, production orders). But your CRM data is in Salesforce (exported to S3), your support ticket data is in PostgreSQL, and your marketing attribution data is in Google BigQuery. Without a federation layer, combining these with SAP data requires building custom pipelines for each source. Dremio federates across all sources in a single query.&lt;/p&gt;
&lt;h3&gt;Reduce HANA Memory Pressure&lt;/h3&gt;
&lt;p&gt;SAP HANA licenses are tied to memory allocation : the more memory provisioned, the higher the license cost. Running analytical workloads in HANA consumes memory resources that compete with transactional OLTP operations. Dremio&apos;s Reflections offload repeated analytical queries from HANA&apos;s engine, reducing memory pressure and potentially allowing you to right-size your HANA memory allocation.&lt;/p&gt;
&lt;h3&gt;AI Analytics on SAP Data&lt;/h3&gt;
&lt;p&gt;SAP&apos;s AI capabilities (SAP Joule, embedded analytics) are tightly coupled to SAP applications. Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions provide AI analytics that span SAP and non-SAP data sources , enabling cross-functional insights that SAP&apos;s tools can&apos;t deliver alone.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SAP HANA hostname or IP address&lt;/strong&gt; : the HANA server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; : typically &lt;code&gt;30015&lt;/code&gt; for single-tenant, or &lt;code&gt;3XX15&lt;/code&gt; for multi-tenant (XX = instance number)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; (required for multi-tenant HANA systems)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : HANA user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the schemas and tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : HANA port must be reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sap-hana-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect SAP HANA to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;SAP HANA&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;sap-hana&lt;/code&gt; or &lt;code&gt;erp-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; HANA server hostname or IP.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;30015&lt;/code&gt; for single-tenant.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; Required for multi-tenant HANA systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Master Credentials (username/password) or Secret Resource URL (AWS Secrets Manager).&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from HANA&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable SSL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encrypt the connection&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query SAP HANA Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query material inventory data
SELECT material_id, material_desc, plant, stock_quantity, unit_of_measure
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD
WHERE plant = &apos;1000&apos; AND stock_quantity &amp;gt; 100
ORDER BY stock_quantity DESC;

-- Financial reporting: GL Account balances
SELECT
  gl_account,
  company_code,
  fiscal_year,
  SUM(debit_amount) AS total_debits,
  SUM(credit_amount) AS total_credits,
  SUM(debit_amount) - SUM(credit_amount) AS net_balance
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.BSEG
WHERE fiscal_year = &apos;2024&apos;
GROUP BY gl_account, company_code, fiscal_year
ORDER BY net_balance DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate SAP with Non-SAP Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join SAP material data with external supplier and demand data
SELECT
  m.material_desc,
  m.stock_quantity,
  m.plant,
  s.supplier_name,
  s.lead_time_days,
  s.unit_cost,
  d.forecasted_demand_30d,
  CASE
    WHEN m.stock_quantity &amp;lt; d.forecasted_demand_30d * 0.5 THEN &apos;Critical - Reorder Now&apos;
    WHEN m.stock_quantity &amp;lt; d.forecasted_demand_30d THEN &apos;Watch - Order Soon&apos;
    ELSE &apos;Adequate&apos;
  END AS inventory_status
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD m
JOIN &amp;quot;postgres-procurement&amp;quot;.public.suppliers s ON m.material_id = s.material_id
LEFT JOIN &amp;quot;s3-forecasting&amp;quot;.demand.material_forecasts d ON m.material_id = d.material_id AND m.plant = d.plant
WHERE s.lead_time_days &amp;lt; 14
ORDER BY s.unit_cost ASC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;SAP handles material masters, PostgreSQL has supplier details, S3 has demand forecasts : Dremio joins them all.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.inventory_health AS
SELECT
  m.material_id,
  m.material_desc,
  m.plant,
  m.stock_quantity,
  CASE
    WHEN m.stock_quantity = 0 THEN &apos;Out of Stock&apos;
    WHEN m.stock_quantity &amp;lt; 50 THEN &apos;Low Stock&apos;
    WHEN m.stock_quantity &amp;lt; 200 THEN &apos;Adequate&apos;
    ELSE &apos;Overstocked&apos;
  END AS stock_status,
  ROUND(m.stock_quantity * s.unit_cost, 2) AS inventory_value_usd
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD m
LEFT JOIN &amp;quot;postgres-procurement&amp;quot;.public.suppliers s ON m.material_id = s.material_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like &amp;quot;inventory_health: One row per material-plant combination showing current stock levels, status classification, and estimated inventory value in USD.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on SAP Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions about SAP data in plain English: &amp;quot;Which materials are low stock at plant 1000?&amp;quot; or &amp;quot;What&apos;s the total inventory value across all plants?&amp;quot; The Agent reads your wiki descriptions, understands SAP terminology through the semantic layer, and generates accurate SQL.&lt;/p&gt;
&lt;p&gt;This is transformative for SAP environments where only specialists know the table structures (MARD, BSEG, VBRK) and field names. The semantic layer translates SAP&apos;s technical schema into business language.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude and ChatGPT to your SAP data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A supply chain manager asks Claude &amp;quot;Show me all critical reorder items combining SAP inventory with supplier lead times&amp;quot; and gets actionable results without knowing SAP table names.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify inventory risk with AI
SELECT
  material_desc,
  stock_quantity,
  stock_status,
  AI_CLASSIFY(
    &apos;Based on inventory levels and value, recommend a procurement action&apos;,
    &apos;Material: &apos; || material_desc || &apos;, Stock: &apos; || CAST(stock_quantity AS VARCHAR) || &apos;, Status: &apos; || stock_status || &apos;, Value: $&apos; || CAST(inventory_value_usd AS VARCHAR),
    ARRAY[&apos;Rush Order&apos;, &apos;Standard Reorder&apos;, &apos;Monitor&apos;, &apos;Liquidate Excess&apos;]
  ) AS procurement_action
FROM analytics.gold.inventory_health
WHERE stock_status IN (&apos;Out of Stock&apos;, &apos;Low Stock&apos;, &apos;Overstocked&apos;);

-- Generate supplier evaluation summaries
SELECT
  s.supplier_name,
  AI_GENERATE(
    &apos;Write a one-sentence supplier performance summary&apos;,
    &apos;Supplier: &apos; || s.supplier_name || &apos;, Lead Time: &apos; || CAST(s.lead_time_days AS VARCHAR) || &apos; days, Unit Cost: $&apos; || CAST(s.unit_cost AS VARCHAR) || &apos;, Materials Supplied: &apos; || CAST(COUNT(m.material_id) AS VARCHAR)
  ) AS performance_summary
FROM &amp;quot;postgres-procurement&amp;quot;.public.suppliers s
JOIN &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD m ON s.material_id = m.material_id
GROUP BY s.supplier_name, s.lead_time_days, s.unit_cost;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for SAP Analytics&lt;/h2&gt;
&lt;p&gt;SAP HANA is expensive to query for analytical workloads. Create Reflections on your semantic layer views to cache results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : for SAP data that changes throughout the day, hourly is typical; for period-end data, daily or weekly&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard queries from Tableau or Power BI hit the Reflection instead of HANA, reducing memory consumption and license pressure. A financial reporting dashboard that queries HANA 96 times per day (15-minute refresh) with a Reflection refreshing every 2 hours consumes HANA resources only 12 times per day : an 87.5% reduction.&lt;/p&gt;
&lt;h2&gt;Governance on SAP Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance that SAP&apos;s built-in security doesn&apos;t extend to non-SAP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask salary data, cost center details, or GL account balances from specific roles. A supply chain analyst sees material inventory but not financial data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by company code, plant, or region based on the querying user&apos;s role. A plant manager sees only their plant&apos;s data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across SAP HANA, PostgreSQL, S3, BigQuery, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server , ensuring consistent access control regardless of how data is queried.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer than JDBC/ODBC. For SAP data, this eliminates the need for SAP BusinessObjects or SAP Analytics Cloud:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector : replaces SAP-specific Tableau drivers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC or native connector : no SAP Gateway needed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for data science on SAP data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on SAP data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query SAP data from their IDE. Ask Copilot &amp;quot;Show me low stock materials at plant 1000 from SAP&amp;quot; and get SQL generated using your semantic layer : without knowing SAP table names like MARD or BSEG.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in HANA vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in HANA:&lt;/strong&gt; Transactional data actively used by SAP applications (OLTP), data with SAP-specific processing (ABAP reports, CDS views, BW extractors), master data referenced by SAP transactions, data subject to SAP transport management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical analytical data (closed fiscal periods, prior-year orders), datasets consumed by non-SAP tools, data where HANA memory cost exceeds analytical value, data needed for AI/ML workloads outside of SAP, archived transaction data that rarely changes.&lt;/p&gt;
&lt;p&gt;For data staying in HANA, create manual Reflections to offload analytical queries. For migrated Iceberg data, Dremio provides automatic compaction, time travel, Autonomous Reflections, and zero per-query license costs.&lt;/p&gt;
&lt;h2&gt;SAP Landscape Integration&lt;/h2&gt;
&lt;p&gt;SAP HANA rarely exists in isolation. Dremio helps connect the SAP landscape with non-SAP analytics:&lt;/p&gt;
&lt;h3&gt;SAP S/4HANA Integration&lt;/h3&gt;
&lt;p&gt;S/4HANA stores business-critical data in HANA tables. Dremio connects to the underlying HANA database and reads these tables directly, bypassing the need for SAP BTP, SAP Analytics Cloud, or custom OData/RFC extractors. This gives analysts SQL access to S/4HANA data :  sales orders, material documents, financial postings ,  alongside non-SAP sources.&lt;/p&gt;
&lt;h3&gt;SAP BW/4HANA Bridge&lt;/h3&gt;
&lt;p&gt;SAP BW/4HANA creates InfoProviders and ADSO tables in HANA. Dremio can query these underlying HANA tables, exposing BW-managed data to non-SAP BI tools. This is valuable for organizations consolidating from SAP Analytics Cloud and BW to a unified BI strategy.&lt;/p&gt;
&lt;h3&gt;Common SAP + Non-SAP Analytics Patterns&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SAP Data (HANA)&lt;/th&gt;
&lt;th&gt;Non-SAP Data&lt;/th&gt;
&lt;th&gt;Analytics Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sales orders (VBAK/VBAP)&lt;/td&gt;
&lt;td&gt;CRM opportunities (PostgreSQL)&lt;/td&gt;
&lt;td&gt;Pipeline-to-revenue tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Material documents (MSEG)&lt;/td&gt;
&lt;td&gt;IoT sensor data (S3)&lt;/td&gt;
&lt;td&gt;Predictive maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Financial postings (BSEG)&lt;/td&gt;
&lt;td&gt;External market data (BigQuery)&lt;/td&gt;
&lt;td&gt;Financial benchmarking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Employee master (PA0001)&lt;/td&gt;
&lt;td&gt;Recruitment data (MongoDB)&lt;/td&gt;
&lt;td&gt;Workforce analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio&apos;s federation engine joins SAP tables with non-SAP sources without extracting SAP data to external systems , maintaining SAP as the system of record.&lt;/p&gt;
&lt;h3&gt;SAP HANA Licensing Considerations&lt;/h3&gt;
&lt;p&gt;SAP HANA licensing is based on memory allocation (RAM). Every analytical query consumes HANA memory resources. Dremio&apos;s Reflections offload analytical workloads from HANA, potentially allowing organizations to reduce HANA memory allocations and associated licensing costs.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;SAP HANA users can extend their SAP analytics beyond the SAP ecosystem : connect HANA, join it with every other source, and enable AI-driven analytics without SAP-specific middleware or additional SAP licenses. Start with Reflections to offload analytical queries from HANA&apos;s in-memory engine, then build a semantic layer for AI Agent access.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sap-hana-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your SAP HANA databases.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect IBM Db2 to Dremio Cloud: Modernize Mainframe Analytics with Federation and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-ibm-db2/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-ibm-db2/</guid><description>
IBM Db2 is the relational database that powers critical applications across banking, insurance, government, healthcare, and manufacturing. For organi...</description><pubDate>Sun, 01 Mar 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;IBM Db2 is the relational database that powers critical applications across banking, insurance, government, healthcare, and manufacturing. For organizations running Db2 :  particularly on IBM Z (mainframes) or IBM i ,  the database holds decades of transactional data: account balances, policy records, claim histories, manufacturing workflows, and government records. This data is enormously valuable for analytics but notoriously difficult to access outside the Db2/IBM ecosystem.&lt;/p&gt;
&lt;p&gt;Traditional approaches to Db2 analytics involve CDC tools (IBM InfoSphere DataStage, Attunity), batch exports, or data replication to a separate analytics warehouse. These approaches are expensive, complex, and create stale copies of data that diverge from the source of truth.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to Db2 (Linux, UNIX, and Windows editions) and queries it alongside modern cloud sources in real time. No CDC infrastructure. No batch exports. Your Db2 data stays in place and joins with S3, PostgreSQL, Snowflake, and any other connected source in a single SQL query.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Dremio&apos;s Db2 connector supports Db2 for LUW (Linux, UNIX, and Windows). Db2 for z/OS and Db2 for i are not directly supported. If your Db2 instance runs on z/OS or IBM i, you may need to set up a Db2 Connect gateway or replicate to a Db2 LUW instance.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Why Db2 Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Access Db2 Data Without IBM Middleware&lt;/h3&gt;
&lt;p&gt;Accessing Db2 analytically typically requires IBM DataStage, IBM Cognos, or custom JDBC applications. These tools are expensive, require specialized skills, and create vendor lock-in. Dremio provides a vendor-neutral SQL layer that connects Db2 to any BI tool (Tableau, Power BI, Looker) via Arrow Flight or ODBC : no IBM middleware needed.&lt;/p&gt;
&lt;h3&gt;Federate Mainframe Data with Cloud Sources&lt;/h3&gt;
&lt;p&gt;Your core banking transactions are in Db2, but your digital banking data is in PostgreSQL on AWS, your customer support data is in MongoDB, and your regulatory data is in S3. Without a federation layer, building a 360-degree customer view requires extracting data from each source into a common warehouse. Dremio queries each in place and joins them at query time.&lt;/p&gt;
&lt;h3&gt;Incremental Modernization&lt;/h3&gt;
&lt;p&gt;Migrating off Db2 is a multi-year, high-risk project that many organizations cannot undertake. Dremio lets you modernize incrementally: start by querying Db2 through Dremio alongside cloud sources, then gradually migrate specific datasets to Iceberg tables. The migration happens over time, with Db2 continuing to serve critical transactional workloads throughout.&lt;/p&gt;
&lt;h3&gt;Cost Reduction&lt;/h3&gt;
&lt;p&gt;IBM mainframe MIPS pricing means every Db2 query consumes expensive compute capacity. Dremio&apos;s Reflections cache analytical results so repeated queries don&apos;t consume Db2 MIPS. This can meaningfully reduce mainframe compute costs for organizations with heavy analytical workloads against Db2.&lt;/p&gt;
&lt;h3&gt;AI on Legacy Data&lt;/h3&gt;
&lt;p&gt;Db2 holds decades of institutional data : customer histories, transaction patterns, risk assessments. Dremio&apos;s AI capabilities make this data accessible to non-technical users and external AI tools, unlocking insights trapped in mainframe systems.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Db2 LUW hostname or IP address&lt;/strong&gt; : the Db2 server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; : default &lt;code&gt;50000&lt;/code&gt; for Db2 LUW&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; : the Db2 database you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : Db2 user with SELECT privileges on the schemas/tables to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : port 50000 reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-ibm-db2-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Db2 to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;IBM Db2&lt;/strong&gt; from the database source types.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;db2-banking&lt;/code&gt; or &lt;code&gt;mainframe-core&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Db2 server hostname or IP address.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;50000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The Db2 database name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Master Credentials (username/password) or Secret Resource URL (AWS Secrets Manager).&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Db2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Configure Reflection Refresh and Metadata, Save&lt;/h3&gt;
&lt;h2&gt;Query Db2 Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query core banking accounts
SELECT
  account_id,
  customer_id,
  account_type,
  current_balance,
  last_transaction_date
FROM &amp;quot;db2-banking&amp;quot;.BANK.ACCOUNTS
WHERE account_type = &apos;SAVINGS&apos; AND current_balance &amp;gt; 10000
ORDER BY current_balance DESC;

-- Transaction analysis
SELECT
  account_type,
  DATE_TRUNC(&apos;month&apos;, transaction_date) AS month,
  COUNT(*) AS transaction_count,
  SUM(transaction_amount) AS total_amount,
  AVG(transaction_amount) AS avg_amount
FROM &amp;quot;db2-banking&amp;quot;.BANK.TRANSACTIONS
WHERE transaction_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY account_type, DATE_TRUNC(&apos;month&apos;, transaction_date)
ORDER BY 1, 2;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate Db2 with Cloud Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Db2 core banking with PostgreSQL digital banking and S3 support data
SELECT
  a.account_id,
  a.current_balance,
  pg.last_login_date,
  pg.mobile_transactions_30d,
  s3.support_tickets_open,
  CASE
    WHEN a.current_balance &amp;gt; 100000 AND pg.mobile_transactions_30d &amp;gt; 10 THEN &apos;High Value - Digitally Active&apos;
    WHEN a.current_balance &amp;gt; 100000 THEN &apos;High Value - Branch Preferred&apos;
    WHEN pg.mobile_transactions_30d &amp;gt; 20 THEN &apos;Digital Native&apos;
    ELSE &apos;Standard&apos;
  END AS customer_segment
FROM &amp;quot;db2-banking&amp;quot;.BANK.ACCOUNTS a
LEFT JOIN &amp;quot;postgres-digital&amp;quot;.public.customer_activity pg ON a.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-support&amp;quot;.tickets.customer_tickets s3 ON a.customer_id = s3.customer_id
ORDER BY a.current_balance DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Mainframe banking data joins with cloud application data in a single query : no CDC, no data extraction.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_banking360 AS
SELECT
  a.customer_id,
  a.account_type,
  a.current_balance,
  pg.customer_name,
  pg.email,
  CASE
    WHEN a.current_balance &amp;gt; 250000 THEN &apos;Private Banking&apos;
    WHEN a.current_balance &amp;gt; 50000 THEN &apos;Premium&apos;
    WHEN a.current_balance &amp;gt; 10000 THEN &apos;Standard&apos;
    ELSE &apos;Basic&apos;
  END AS service_tier,
  DATEDIFF(DAY, a.last_transaction_date, CURRENT_DATE) AS days_since_last_transaction
FROM &amp;quot;db2-banking&amp;quot;.BANK.ACCOUNTS a
LEFT JOIN &amp;quot;postgres-digital&amp;quot;.public.customers pg ON a.customer_id = pg.customer_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like: &amp;quot;customer_banking360: Combines mainframe core banking data with digital channel activity to provide a complete customer view for relationship management.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Db2 Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent transforms access to mainframe data. Instead of needing a Db2 DBA to write queries against complex schemas, a relationship manager asks &amp;quot;Show me all Private Banking customers who haven&apos;t transacted in 30 days&amp;quot; and gets accurate results from the semantic layer. The Agent reads your wiki descriptions to understand what &amp;quot;Private Banking&amp;quot; (balance &amp;gt; $250K) and &amp;quot;days_since_last_transaction&amp;quot; mean.&lt;/p&gt;
&lt;p&gt;This democratizes access to decades of mainframe data that was previously accessible only through COBOL reports or specialized IBM tools.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your Db2 data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A compliance officer asks Claude &amp;quot;Show me all accounts with balances over $100K and no transactions in 60 days for our dormancy review&amp;quot; and gets a governed, accurate report from Db2 : without knowing Db2 table structures.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify account risk with AI
SELECT
  customer_id,
  service_tier,
  current_balance,
  days_since_last_transaction,
  AI_CLASSIFY(
    &apos;Based on these banking patterns, classify the account dormancy risk&apos;,
    &apos;Tier: &apos; || service_tier || &apos;, Balance: $&apos; || CAST(current_balance AS VARCHAR) || &apos;, Days Inactive: &apos; || CAST(days_since_last_transaction AS VARCHAR),
    ARRAY[&apos;Active&apos;, &apos;At Risk&apos;, &apos;Potentially Dormant&apos;, &apos;Dormant&apos;]
  ) AS dormancy_risk
FROM analytics.gold.customer_banking360
WHERE days_since_last_transaction &amp;gt; 30;

-- Generate relationship manager talking points
SELECT
  customer_name,
  service_tier,
  AI_GENERATE(
    &apos;Write a one-sentence talking point for a relationship manager reaching out to this customer&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Tier: &apos; || service_tier || &apos;, Balance: $&apos; || CAST(current_balance AS VARCHAR) || &apos;, Inactive Days: &apos; || CAST(days_since_last_transaction AS VARCHAR)
  ) AS outreach_talking_point
FROM analytics.gold.customer_banking360
WHERE service_tier = &apos;Private Banking&apos; AND days_since_last_transaction &amp;gt; 14;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Mainframe Cost Reduction&lt;/h2&gt;
&lt;p&gt;Every query against Db2 on a mainframe consumes MIPS. Create Reflections to cache frequently accessed analytics:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : hourly for active accounts, daily for historical analysis&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard and reporting queries hit Reflections instead of Db2, significantly reducing mainframe compute consumption. A compliance dashboard that refreshes every 15 minutes generates zero Db2 MIPS after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance on Db2 Data&lt;/h2&gt;
&lt;p&gt;Banking, insurance, and government organizations have strict data governance requirements. Dremio&apos;s Fine-Grained Access Control (FGAC) extends Db2 security to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask account balances, SSNs, and transaction amounts from specific roles. A marketing analyst sees customer segments but not financial data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Branch-level access control : a branch manager sees only their branch&apos;s accounts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Db2, PostgreSQL, S3, and all other connected sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server : meeting regulatory requirements for data access control.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access to mainframe data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector : no IBM middleware&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to Db2 data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on Db2 data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Db2 data from their IDE. Ask Copilot &amp;quot;Show me dormant high-value accounts from Db2&amp;quot; and get SQL generated using your semantic layer : without knowing Db2 table schemas or COBOL naming conventions.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Db2 vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Db2:&lt;/strong&gt; Active transactional data for applications, data with COBOL program dependencies, regulatory data that must maintain system of record status, data subject to mainframe-specific compliance requirements.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical transaction archives (closed accounts, prior fiscal years), data consumed by non-mainframe tools, datasets where mainframe MIPS cost exceeds analytical value, archived data for long-term retention.&lt;/p&gt;
&lt;p&gt;For data staying in Db2, create manual Reflections to reduce MIPS consumption. For migrated Iceberg data, Dremio provides automatic compaction, time travel, Autonomous Reflections, and dramatically lower storage costs.&lt;/p&gt;
&lt;h2&gt;Db2 Character Encoding and Data Types&lt;/h2&gt;
&lt;p&gt;Db2 uses EBCDIC encoding on mainframes and ASCII/UTF-8 on LUW platforms. When connecting through Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;EBCDIC to UTF-8:&lt;/strong&gt; Db2 for LUW handles character conversion automatically : Dremio receives standard Unicode data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GRAPHIC/VARGRAPHIC:&lt;/strong&gt; Double-byte character columns map to VARCHAR in Dremio&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DECIMAL/NUMERIC:&lt;/strong&gt; Db2&apos;s fixed-point types map to Dremio&apos;s DECIMAL with matching precision/scale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DATE/TIME/TIMESTAMP:&lt;/strong&gt; Standard mapping : Db2 timestamps map to Dremio TIMESTAMP&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Regulatory Compliance Patterns&lt;/h2&gt;
&lt;p&gt;Banking, insurance, and government organizations have strict data retention and access requirements. Dremio addresses these:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Dremio Feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data residency&lt;/td&gt;
&lt;td&gt;Query data in place : no cross-border data movement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access auditing&lt;/td&gt;
&lt;td&gt;Query logs track who queried what data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Column-level security&lt;/td&gt;
&lt;td&gt;FGAC column masking hides sensitive fields&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Row-level security&lt;/td&gt;
&lt;td&gt;FGAC row filtering restricts data by user role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data retention&lt;/td&gt;
&lt;td&gt;Time travel on Iceberg tables provides point-in-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Mainframe Modernization Roadmap&lt;/h2&gt;
&lt;p&gt;Use Dremio as the bridge in a multi-year mainframe modernization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 (Months 1-3):&lt;/strong&gt; Connect Db2 to Dremio Cloud. Create Reflections to offload analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 (Months 4-6):&lt;/strong&gt; Build a semantic layer over Db2 data. Enable AI Agent and MCP Server for business users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 3 (Months 7-12):&lt;/strong&gt; Identify high-value datasets for migration to Iceberg. Use &lt;code&gt;CREATE TABLE AS SELECT&lt;/code&gt; to migrate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 4 (Year 2+):&lt;/strong&gt; Gradually migrate remaining datasets as mainframe contracts renew. Db2 focus narrows to core OLTP.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Throughout the process, users experience no disruption : they continue using the same semantic layer views. Only the underlying data sources change.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Db2 users can modernize analytics without migrating off the mainframe : federate, govern, accelerate, and AI-enable decades of institutional data through Dremio Cloud. Start with Reflections to offload analytical queries from Db2, then progressively build a semantic layer that makes legacy data accessible to modern AI tools and business users.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-ibm-db2-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Db2 databases.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Microsoft SQL Server to Dremio Cloud: Federate Enterprise Data Without ETL</title><link>https://iceberglakehouse.com/posts/2026-03-connector-microsoft-sql-server/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-microsoft-sql-server/</guid><description>
Microsoft SQL Server is one of the most widely deployed enterprise databases in the world. ERP systems, CRM platforms, financial applications, and cu...</description><pubDate>Sun, 01 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Microsoft SQL Server is one of the most widely deployed enterprise databases in the world. ERP systems, CRM platforms, financial applications, and custom business applications run on SQL Server across on-premises data centers and Azure cloud deployments. But connecting SQL Server data to a modern analytics platform typically requires building ETL pipelines, managing SSIS packages, or purchasing additional SQL Server Enterprise licenses for analytics workloads.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to SQL Server and queries it alongside S3, PostgreSQL, Snowflake, BigQuery, MongoDB, and every other connected source in a single SQL query. You don&apos;t need to extract data from SQL Server, build staging tables, or manage nightly ETL jobs. Dremio reads SQL Server in place, applies governance, and accelerates repeated queries with Reflections.&lt;/p&gt;
&lt;p&gt;SQL Server licensing is notoriously expensive : Enterprise edition costs tens of thousands of dollars per core. Running analytical queries directly against production SQL Server instances consumes CPU capacity that&apos;s licensed for transactional workloads. Dremio&apos;s Reflections cache analytical results, offloading read-heavy queries from SQL Server and potentially allowing organizations to reduce their SQL Server core count or downgrade from Enterprise to Standard edition.&lt;/p&gt;
&lt;h2&gt;Why SQL Server Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Escape Linked Server Limitations&lt;/h3&gt;
&lt;p&gt;SQL Server&apos;s linked servers provide basic federation, but they&apos;re limited: poor cross-platform support (try linking to MongoDB or BigQuery), no query optimization across links, no governance layer, and performance degrades with large result sets. Dremio&apos;s federation engine is purpose-built for cross-source queries : it pushes predicates to each source, optimizes join strategies, and handles large-scale data movement efficiently.&lt;/p&gt;
&lt;h3&gt;Reduce SQL Server License Costs&lt;/h3&gt;
&lt;p&gt;SQL Server Enterprise licensing is expensive : especially when analytical workloads compete with transactional OLTP operations for CPU and memory. Dremio&apos;s Reflections offload repeated analytical queries from SQL Server: dashboard refreshes, scheduled reports, and ad-hoc exploration hit cached Reflections instead of SQL Server. This can reduce the SQL Server resources dedicated to analytics, potentially allowing you to downgrade from Enterprise to Standard edition or reduce core counts.&lt;/p&gt;
&lt;h3&gt;Multi-Cloud, Multi-Database Analytics&lt;/h3&gt;
&lt;p&gt;Your SQL Server holds ERP data, but your data lake is on S3, your marketing data is in Google BigQuery, and your modern applications use PostgreSQL. Without Dremio, combining these requires SSIS packages, Azure Data Factory, or custom ETL for each source. Dremio queries all of them in a single SQL statement.&lt;/p&gt;
&lt;h3&gt;Unified Governance Beyond Windows&lt;/h3&gt;
&lt;p&gt;SQL Server has Windows Authentication and SQL Logins, but these don&apos;t apply to your S3 data lake, BigQuery, or PostgreSQL. Dremio&apos;s Fine-Grained Access Control applies column masking and row-level filtering consistently across SQL Server and every other connected source.&lt;/p&gt;
&lt;h3&gt;AI Analytics on Enterprise Data&lt;/h3&gt;
&lt;p&gt;SQL Server stores decades of business data : financial records, customer histories, inventory movements. Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions make that historical data queryable by natural language and enrichable by AI, unlocking insights that would otherwise require a data analyst with deep institutional knowledge.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL Server hostname or IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; : default &lt;code&gt;1433&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; (SQL Authentication) : user needs SELECT permissions on target schemas and tables&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : port 1433 must be reachable from Dremio Cloud. For on-premises SQL Server, configure VPN or firewall rules&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sqlserver-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect SQL Server to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;Microsoft SQL Server&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;sqlserver-erp&lt;/code&gt; or &lt;code&gt;production-db&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; SQL Server hostname or IP address.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;1433&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The database name to connect to.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Enter SQL Authentication credentials (username/password) or use Secret Resource URL for centralized credential management via AWS Secrets Manager.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from SQL Server&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSL Verification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verify SSL server certificate&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hostname in Certificate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expected hostname in SSL certificate&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Configure Reflection Refresh and Metadata, Save&lt;/h3&gt;
&lt;h2&gt;Query SQL Server Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query ERP inventory data
SELECT
  product_id,
  product_name,
  warehouse_location,
  quantity_on_hand,
  reorder_point
FROM &amp;quot;sqlserver-erp&amp;quot;.dbo.products
WHERE quantity_on_hand &amp;lt; reorder_point
ORDER BY quantity_on_hand ASC;

-- Financial reporting
SELECT
  department_code,
  account_category,
  fiscal_quarter,
  SUM(actual_amount) AS actual_spend,
  SUM(budget_amount) AS budgeted,
  ROUND((SUM(actual_amount) - SUM(budget_amount)) / NULLIF(SUM(budget_amount), 0) * 100, 1) AS variance_pct
FROM &amp;quot;sqlserver-erp&amp;quot;.finance.budget_actuals
WHERE fiscal_year = 2024
GROUP BY department_code, account_category, fiscal_quarter
ORDER BY ABS(SUM(actual_amount) - SUM(budget_amount)) DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate SQL Server with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join SQL Server ERP with PostgreSQL CRM and S3 marketing data
SELECT
  ss.product_name,
  ss.quantity_on_hand,
  pg.total_orders,
  pg.avg_order_value,
  s3.click_through_rate,
  CASE
    WHEN pg.total_orders &amp;gt; 100 AND ss.quantity_on_hand &amp;lt; 50 THEN &apos;Reorder - High Demand&apos;
    WHEN pg.total_orders &amp;lt; 10 AND ss.quantity_on_hand &amp;gt; 500 THEN &apos;Overstock - Reduce&apos;
    ELSE &apos;Normal&apos;
  END AS inventory_action
FROM &amp;quot;sqlserver-erp&amp;quot;.dbo.products ss
LEFT JOIN (
  SELECT product_id, COUNT(*) AS total_orders, AVG(order_value) AS avg_order_value
  FROM &amp;quot;postgres-crm&amp;quot;.public.orders
  WHERE order_date &amp;gt;= &apos;2024-01-01&apos;
  GROUP BY product_id
) pg ON ss.product_id = pg.product_id
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.analytics.product_clicks s3 ON ss.product_id = s3.product_id
WHERE ss.quantity_on_hand &amp;lt; ss.reorder_point OR pg.total_orders &amp;gt; 100
ORDER BY pg.total_orders DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.inventory_management AS
SELECT
  p.product_id,
  p.product_name,
  p.warehouse_location,
  p.quantity_on_hand,
  p.reorder_point,
  CASE
    WHEN p.quantity_on_hand = 0 THEN &apos;Out of Stock&apos;
    WHEN p.quantity_on_hand &amp;lt; p.reorder_point * 0.5 THEN &apos;Critical&apos;
    WHEN p.quantity_on_hand &amp;lt; p.reorder_point THEN &apos;Low&apos;
    ELSE &apos;Adequate&apos;
  END AS stock_status,
  ROUND(p.quantity_on_hand * p.unit_cost, 2) AS inventory_value
FROM &amp;quot;sqlserver-erp&amp;quot;.dbo.products p;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like: &amp;quot;inventory_management: One row per product showing current stock levels, stock status classification, and estimated inventory value. Use this view to monitor reorder needs.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on SQL Server Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets operations managers ask &amp;quot;Which products are critically low at the Chicago warehouse?&amp;quot; without writing SQL. The Agent reads your wiki descriptions, understands &amp;quot;Critical&amp;quot; means stock below 50% of reorder point, and generates accurate queries. This is transformative for SQL Server environments where tribal knowledge about table schemas and column meanings lives in senior employees&apos; heads.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your SQL Server data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A warehouse manager asks Claude &amp;quot;Show me all products that need reordering, sorted by how critical the shortage is&amp;quot; and gets actionable results from the semantic layer over SQL Server : no SQL, no SSMS.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate reorder recommendations with AI
SELECT
  product_name,
  stock_status,
  quantity_on_hand,
  reorder_point,
  AI_GENERATE(
    &apos;Write a one-sentence reorder recommendation based on inventory status&apos;,
    &apos;Product: &apos; || product_name || &apos;, Stock: &apos; || CAST(quantity_on_hand AS VARCHAR) || &apos;, Reorder Point: &apos; || CAST(reorder_point AS VARCHAR) || &apos;, Status: &apos; || stock_status
  ) AS reorder_recommendation
FROM analytics.gold.inventory_management
WHERE stock_status IN (&apos;Critical&apos;, &apos;Out of Stock&apos;);

-- Classify financial variances
SELECT
  department_code,
  variance_pct,
  AI_CLASSIFY(
    &apos;Based on the budget variance, classify the financial risk level&apos;,
    &apos;Department: &apos; || department_code || &apos;, Variance: &apos; || CAST(variance_pct AS VARCHAR) || &apos;%&apos;,
    ARRAY[&apos;On Track&apos;, &apos;Minor Variance&apos;, &apos;Significant Overspend&apos;, &apos;Critical Overspend&apos;]
  ) AS financial_risk
FROM (
  SELECT department_code, ROUND((SUM(actual_amount) - SUM(budget_amount)) / NULLIF(SUM(budget_amount), 0) * 100, 1) AS variance_pct
  FROM &amp;quot;sqlserver-erp&amp;quot;.finance.budget_actuals
  WHERE fiscal_year = 2024
  GROUP BY department_code
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;SQL Server Enterprise charges per-core licensing. Offloading analytical queries to Reflections reduces compute pressure on SQL Server cores:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : for ERP data updated throughout the day, hourly; for financial data, match to reporting cycles&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools get sub-second response times from Reflections. SQL Server focuses on transactional OLTP workloads. A financial dashboard refreshing every 15 minutes generates zero SQL Server load after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance Across SQL Server and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) extends SQL Server security to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask financial data, salary details, or PII from specific roles. A warehouse manager sees inventory levels but not cost data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional managers see only their region&apos;s data. Department heads see only their department.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across SQL Server, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC for SQL Server data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio native connector : ideal for Microsoft-centric organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query SQL Server data from their IDE. Ask Copilot &amp;quot;Show me products below reorder point at the Chicago warehouse&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in SQL Server vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in SQL Server:&lt;/strong&gt; Transactional data for active applications, data with stored procedures and triggers, operational systems that depend on SQL Server features (SSRS, SSIS, linked servers).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical records and archives, reporting data, data consumed by non-SQL-Server tools, datasets where SQL Server per-core licensing cost exceeds analytical value. Migrated Iceberg tables get Dremio&apos;s automatic compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in SQL Server, create manual Reflections. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Query Pushdown to SQL Server&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s federation engine optimizes cross-source queries by pushing operations to SQL Server whenever possible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Filter pushdown:&lt;/strong&gt; &lt;code&gt;WHERE&lt;/code&gt; clauses are pushed to SQL Server, so only matching rows are transferred&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Projection pushdown:&lt;/strong&gt; Only the columns referenced in your query are requested from SQL Server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregate pushdown:&lt;/strong&gt; &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt; operations can be executed on SQL Server when the full query allows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This minimizes data transfer between SQL Server and Dremio, reducing network traffic and improving query performance.&lt;/p&gt;
&lt;h2&gt;ERP Integration Patterns&lt;/h2&gt;
&lt;p&gt;SQL Server frequently powers ERP systems (Microsoft Dynamics, custom internal ERPs). Dremio enables analytics that combine ERP data with external sources:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SQL Server (ERP)&lt;/th&gt;
&lt;th&gt;External Source&lt;/th&gt;
&lt;th&gt;Analytics Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Inventory levels&lt;/td&gt;
&lt;td&gt;S3 demand forecasts&lt;/td&gt;
&lt;td&gt;Automated reorder predictions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Purchase orders&lt;/td&gt;
&lt;td&gt;PostgreSQL supplier data&lt;/td&gt;
&lt;td&gt;Supplier performance scoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Financial actuals&lt;/td&gt;
&lt;td&gt;BigQuery market data&lt;/td&gt;
&lt;td&gt;Revenue benchmarking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer accounts&lt;/td&gt;
&lt;td&gt;MongoDB support tickets&lt;/td&gt;
&lt;td&gt;Churn risk assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These cross-source analytics are impossible with SQL Server alone and traditionally require SQL Server Integration Services (SSIS) to build ETL pipelines. Dremio eliminates this requirement entirely.&lt;/p&gt;
&lt;h2&gt;SQL Server Always Encrypted and SSL&lt;/h2&gt;
&lt;p&gt;Dremio supports SSL/TLS connections to SQL Server. For databases using Always Encrypted columns, be aware that Dremio reads the encrypted values : decryption requires the Column Master Key, which is managed by the application. For analytical workloads, consider creating views on the SQL Server side that expose non-encrypted analytical summaries.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;SQL Server users can federate enterprise data, reduce license costs, deploy AI analytics, and apply unified governance across their entire data estate. Start by connecting your primary SQL Server instance to Dremio Cloud. Create Reflections on your most-queried reporting tables to offload analytical queries from SQL Server immediately, reducing CPU load and freeing licensed cores for transactional workloads.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sqlserver-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your SQL Server instances.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Extract Structured Data from Text with Dremio&apos;s AI_GENERATE Function</title><link>https://iceberglakehouse.com/posts/2026-03-ai-ai-generate/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-ai-ai-generate/</guid><description>
Unstructured text is the most underused data in most organizations. Customer emails sit in inboxes. Contract notes live in text fields. Meeting summa...</description><pubDate>Sun, 01 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Unstructured text is the most underused data in most organizations. Customer emails sit in inboxes. Contract notes live in text fields. Meeting summaries exist as free-text columns in CRM systems. The information is there, but it&apos;s locked inside prose that SQL can&apos;t filter, join, or aggregate.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;code&gt;AI_GENERATE&lt;/code&gt; function breaks that lock. It sends unstructured text to an LLM and returns structured rows with typed columns. You define the output schema directly in SQL, and the LLM extracts the fields you specify. An email becomes a row with &lt;code&gt;sender&lt;/code&gt;, &lt;code&gt;subject&lt;/code&gt;, &lt;code&gt;priority&lt;/code&gt;, and &lt;code&gt;action_items&lt;/code&gt; columns. A contract note becomes a row with &lt;code&gt;party_name&lt;/code&gt;, &lt;code&gt;contract_value&lt;/code&gt;, &lt;code&gt;start_date&lt;/code&gt;, and &lt;code&gt;terms&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This tutorial builds a complete document processing pipeline in a fresh Dremio Cloud account. You&apos;ll create sample email and contract data, build a medallion architecture, and use &lt;code&gt;AI_GENERATE&lt;/code&gt; to extract structured fields from free text. A separate section covers using &lt;code&gt;AI_GENERATE&lt;/code&gt; with &lt;code&gt;LIST_FILES&lt;/code&gt; to process unstructured files (PDFs, text files) stored in object storage.&lt;/p&gt;
&lt;h2&gt;What You&apos;ll Build&lt;/h2&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A dataset with 50+ raw emails and 25+ contract notes containing free-text descriptions&lt;/li&gt;
&lt;li&gt;Bronze views that standardize raw data&lt;/li&gt;
&lt;li&gt;Silver views that join emails with contract information&lt;/li&gt;
&lt;li&gt;Gold views that use &lt;code&gt;AI_GENERATE&lt;/code&gt; with &lt;code&gt;WITH SCHEMA&lt;/code&gt; to extract structured fields from text&lt;/li&gt;
&lt;li&gt;Materialized Iceberg tables that persist extracted data for downstream analytics&lt;/li&gt;
&lt;li&gt;An understanding of how to combine &lt;code&gt;AI_GENERATE&lt;/code&gt; with &lt;code&gt;LIST_FILES&lt;/code&gt; for file-based extraction&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-generate-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI enabled&lt;/strong&gt; : go to Admin → Project Settings → Preferences → AI section and enable AI features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Provider configured&lt;/strong&gt; : Dremio provides a hosted LLM by default, or connect your own (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Tables in the built-in Open Catalog use &lt;code&gt;folder.subfolder.table_name&lt;/code&gt; without a catalog prefix. External sources use &lt;code&gt;source_name.schema.table_name&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Understanding AI_GENERATE&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AI_GENERATE&lt;/code&gt; is the most powerful of Dremio&apos;s AI SQL functions because it returns structured data from unstructured input. The function signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;AI_GENERATE(
  [model_name VARCHAR,]
  prompt VARCHAR,
  target_data VARCHAR
  [WITH SCHEMA (field_name DATA_TYPE, ...)]
) → ROW | VARCHAR
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;model_name&lt;/strong&gt; (optional) : specify a model like &lt;code&gt;&apos;openai.gpt-4o&apos;&lt;/code&gt;. Format is &lt;code&gt;modelProvider.modelName&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prompt&lt;/strong&gt; : the extraction instruction telling the LLM what fields to find in the target data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;target_data&lt;/strong&gt; : the unstructured text column to process. This is usually a column from your table containing emails, notes, descriptions, or document content.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WITH SCHEMA&lt;/strong&gt; (optional but recommended) : defines the output structure as a ROW type with named, typed columns. Without it, &lt;code&gt;AI_GENERATE&lt;/code&gt; returns a &lt;code&gt;VARCHAR&lt;/code&gt; (plain text). With it, you get a &lt;code&gt;ROW&lt;/code&gt; that you can expand using dot notation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;WITH SCHEMA&lt;/code&gt; clause is what makes &lt;code&gt;AI_GENERATE&lt;/code&gt; different from &lt;code&gt;AI_COMPLETE&lt;/code&gt;. Instead of getting free-form text back, you get a typed row where each field is a column you defined, ready for filtering, joining, and aggregating.&lt;/p&gt;
&lt;h3&gt;ROW Type Output&lt;/h3&gt;
&lt;p&gt;When you use &lt;code&gt;WITH SCHEMA&lt;/code&gt;, the result is a &lt;code&gt;ROW&lt;/code&gt; type. Access individual fields with dot notation:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  result.sender,
  result.priority,
  result.action_items
FROM (
  SELECT AI_GENERATE(
    &apos;Extract key information from this email&apos;,
    email_body
    WITH SCHEMA (sender VARCHAR, priority VARCHAR, action_items VARCHAR)
  ) AS result
  FROM emails
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 1: Create Your Folder Structure&lt;/h2&gt;
&lt;p&gt;Open the &lt;strong&gt;SQL Runner&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS aigenerateexp;
CREATE FOLDER IF NOT EXISTS aigenerateexp.document_data;
CREATE FOLDER IF NOT EXISTS aigenerateexp.bronze;
CREATE FOLDER IF NOT EXISTS aigenerateexp.silver;
CREATE FOLDER IF NOT EXISTS aigenerateexp.gold;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 2: Seed Your Sample Data&lt;/h2&gt;
&lt;h3&gt;Raw Emails Table&lt;/h3&gt;
&lt;p&gt;This table simulates emails stored in a CRM system. Each email has a free-text body that contains multiple pieces of information: who sent it, what they&apos;re asking about, how urgent it is, and what action is needed. Extracting these fields manually would require a human to read each email. &lt;code&gt;AI_GENERATE&lt;/code&gt; automates this.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.document_data.raw_emails (
  email_id INT,
  received_date DATE,
  email_body VARCHAR
);

INSERT INTO aigenerateexp.document_data.raw_emails VALUES
(1, &apos;2025-09-01&apos;, &apos;Hi team, this is Sarah Chen from Acme Corp. We need to urgently discuss the renewal of our enterprise license. Our current contract expires on October 15th and we want to add 200 additional seats. Can someone from your licensing team contact me by end of week? My direct line is 555-0142. Thanks, Sarah&apos;),
(2, &apos;2025-09-02&apos;, &apos;To whom it may concern, I am writing to report a critical production outage affecting our CloudSync deployment. All file synchronization stopped at 3:47 AM EST this morning. Over 500 users are impacted. We need immediate escalation to your Level 3 support team. This is a P1 issue per our SLA terms. Regards, James Rodriguez, VP of IT, Global Industries&apos;),
(3, &apos;2025-09-03&apos;, &apos;Hello, my name is Emily Watson and I am the procurement manager at TechStart Inc. We are evaluating DataVault Enterprise for our compliance requirements. Could you send me pricing information for a 3-year commitment with 150 users? Also interested in the SOC 2 audit documentation. Our budget review is scheduled for next month so no rush. Best, Emily&apos;),
(4, &apos;2025-09-05&apos;, &apos;URGENT: Our QuickReport installation has been down for 6 hours. Dashboard presentations to the board of directors are in 2 hours. We need the reporting engine restored immediately. Client: MegaCorp Financial. Contact: David Kim, CFO. Phone: 555-0198. This is affecting our quarterly earnings presentation.&apos;),
(5, &apos;2025-09-06&apos;, &apos;Hi there, I wanted to share some positive feedback. Your DevPipeline product has reduced our deployment time from 45 minutes to under 3 minutes. Our engineering team of 80 developers is very happy with the migration. We are considering expanding to our European offices next quarter. Great product! - Michael Brown, CTO, CloudNine Software&apos;),
(6, &apos;2025-09-08&apos;, &apos;Dear Support, we recently purchased MailForge for our marketing team but are having trouble with the SMTP relay configuration. Emails are being flagged as spam by Gmail and Outlook recipients. Our deliverability rate dropped from 98% to 62% after switching to MailForge. This is not urgent but needs resolution by end of month before our holiday campaign launches. Sincerely, Lisa Park, Marketing Director, RetailPlus&apos;),
(7, &apos;2025-09-10&apos;, &apos;To the sales team: We are a healthcare organization looking for a HIPAA-compliant backup solution. We evaluated CloudBackup but have concerns about the BAA terms in section 4.2. Can your legal team review our proposed amendments? We handle PHI for approximately 50000 patients. Timeline: need decision by November 1st. Contact: Dr. Anna Kowalski, Chief Medical Information Officer, Metro Health System&apos;),
(8, &apos;2025-09-11&apos;, &apos;I am writing to formally request cancellation of our HelpDesk360 subscription effective immediately. The product has not met our expectations. Response routing is inaccurate, the knowledge base search returns irrelevant results, and we have experienced 3 unplanned outages in the past month. Please process our refund for the remaining 8 months on our annual contract. Robert Taylor, Operations Director, ServiceFirst Ltd&apos;),
(9, &apos;2025-09-12&apos;, &apos;Quick question: does FormBuilder support WCAG 2.1 AA compliance for government forms? We are a state agency and this is a hard requirement for procurement. If yes, can you point me to the VPAT documentation? Thanks, Maria Garcia, Accessibility Coordinator, State of California Department of Technology&apos;),
(10, &apos;2025-09-14&apos;, &apos;Hi, our team has been using TeamBoard for 6 months and we love it. However we really need a way to export Gantt charts to PDF while preserving the formatting. The current export flattens all the dependency lines and makes the chart unreadable. Is this on your roadmap? Our PMO presents these charts to clients weekly. Tom Williams, PMO Lead, ConsultCo&apos;),
(11, &apos;2025-09-15&apos;, &apos;INCIDENT REPORT: At approximately 14:22 UTC our SecureSign production environment began experiencing signature verification failures. Approximately 340 pending documents across 12 customer accounts are affected. Root cause appears to be an expired intermediate SSL certificate in your signing chain. We need immediate remediation. Kevin Thompson, Security Engineer, LegalTech Partners&apos;),
(12, &apos;2025-09-17&apos;, &apos;Dear team, we operate DataStream to process 2TB of Kafka events daily. Starting last week we noticed exactly-once processing guarantees are failing intermittently. Approximately 0.3% of events are being duplicated in our downstream Postgres sink. This is causing financial reconciliation errors in our billing system. Medium priority but needs attention within 2 weeks. Jennifer Lee, Senior Data Engineer, FinServ Analytics&apos;),
(13, &apos;2025-09-18&apos;, &apos;I would like to schedule a product demo of AdOptimizer for our digital marketing agency. We manage ad spend for 45 clients across Google Ads Facebook and LinkedIn totaling approximately 2.5M monthly. Currently using a competitor but unhappy with the attribution modeling accuracy. When is your team available next week? Chris Martinez, Founder, DigitalEdge Agency&apos;),
(14, &apos;2025-09-20&apos;, &apos;Hi, we just completed our evaluation of ContractManager and would like to proceed with a purchase for 75 seats. We need the Salesforce integration enabled from day one. Our legal team processes roughly 200 contracts per month and we are currently tracking everything in spreadsheets. What is the implementation timeline? Rachel Adams, General Counsel, NovaTech Industries&apos;),
(15, &apos;2025-09-22&apos;, &apos;Attention: We detected unauthorized API access attempts against our LogInsight deployment between 2AM and 4AM EST today. The requests originated from IP addresses in a known threat intelligence database. While our firewall blocked the attempts, we want to understand if LogInsight has additional rate limiting or IP blocking capabilities we should enable. Mark Allen, CISO, DataShield Corp&apos;),
(16, &apos;2025-09-24&apos;, &apos;To billing department: Our organization PayFlow account 8847291 shows a currency conversion fee of 2.8% on GBP transactions. Our contract specifies a 1.5% rate for all EUR and GBP conversions. Please correct this billing discrepancy retroactively for September transactions totaling approximately 45000 GBP. Amanda Clark, Treasury Manager, EuroCommerce BV&apos;),
(17, &apos;2025-09-25&apos;, &apos;Hello, we have been running ChatAssist for our e-commerce customer support and the intent classification accuracy is excellent at around 94%. However we need to add support for Portuguese and Thai languages. Our customer base expanded to Brazil and Thailand this quarter. Is the multi-language add-on available for our current plan tier? Steven Moore, VP Customer Experience, GlobalShop&apos;),
(18, &apos;2025-09-27&apos;, &apos;I am the HR director at a 2000-employee manufacturing company. We need SchedulePro to handle complex shift patterns including rotating shifts split shifts and on-call schedules. Our current system cannot handle the overtime calculations required by state-specific labor laws in California New York and Texas. Can SchedulePro handle multi-state labor law compliance? Catherine Hall, HR Director, PrecisionMfg Inc&apos;),
(19, &apos;2025-09-28&apos;, &apos;Feature request: DesignHub needs better support for design tokens and component variables. When we update a color in our design system it should propagate to all linked components across all projects automatically. Currently we have to manually update 200+ components which defeats the purpose of a design system. Otherwise great product. Brian Harris, Design Systems Lead, PixelPerfect Studio&apos;),
(20, &apos;2025-09-30&apos;, &apos;Dear sales, I am reaching out on behalf of a consortium of 12 regional banks looking for a unified API management solution. We collectively process 4.2M API requests daily and need a solution that supports PSD2 compliance including strong customer authentication and secure communication. Can we arrange a meeting with your banking vertical team? Daniel Wilson, Technology Director, Regional Banking Alliance&apos;),
(21, &apos;2025-10-01&apos;, &apos;Hi, quick update on our InventoryTrack implementation. The barcode scanning module is working perfectly in our main warehouse but the multi-warehouse sync is showing a 15-minute delay between facilities. For perishable goods this delay causes stock discrepancies. Can we reduce the sync interval to real-time? Sophia Nguyen, Warehouse Operations Manager, FreshFoods Distribution&apos;),
(22, &apos;2025-10-03&apos;, &apos;To the product team at CloudSync: I have been a loyal customer for 3 years and want to share feedback. The recent UI redesign is excellent but the new settings menu is confusing. I cannot find the bandwidth throttling option which I use daily. Please make frequently used settings more accessible. Otherwise love the product and have recommended it to 5 colleagues. Laura Jackson, IT Consultant&apos;),
(23, &apos;2025-10-05&apos;, &apos;CRITICAL: Our DataVault encryption at rest failed an internal penetration test. The AES-256 implementation is using ECB mode instead of CBC or GCM for blocks larger than 16 bytes. This is a known vulnerability pattern. We need confirmation that this will be patched before our next compliance audit on November 15th. Michelle Lopez, Information Security Analyst, SecureBank NA&apos;),
(24, &apos;2025-10-06&apos;, &apos;Hello, I am a professor at MIT and we use QuickReport for our research data visualization. We are interested in an academic licensing program. Our department has 35 researchers and 120 graduate students who would benefit from the tool. Is there an education discount available? Dr. Jessica Young, Department of Data Science, MIT&apos;),
(25, &apos;2025-10-08&apos;, &apos;Support ticket follow-up: Our MailForge DKIM configuration issue ticket 4421 was marked resolved but we are still failing DMARC checks from Yahoo and AOL. The DKIM record appears correctly in DNS but the selector value does not match what MailForge sends in the email headers. Need this escalated back to engineering. Andrew White, Email Administrator, NewsMedia Group&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Contract Notes Table&lt;/h3&gt;
&lt;p&gt;This table simulates free-text contract summaries written by account managers. Each note contains key contract details buried in natural language that &lt;code&gt;AI_GENERATE&lt;/code&gt; will extract into structured columns.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.document_data.contract_notes (
  note_id INT,
  account_manager VARCHAR,
  note_date DATE,
  note_text VARCHAR
);

INSERT INTO aigenerateexp.document_data.contract_notes VALUES
(1, &apos;Patricia Moore&apos;, &apos;2025-09-01&apos;, &apos;Closed deal with Acme Corp for CloudSync Pro enterprise license. 500 seats at $22/seat/month for 3-year term. Total contract value $396,000. Includes premium support and 99.99% SLA. Renewal auto-triggers 90 days before expiration. Key contact: Sarah Chen, VP Engineering.&apos;),
(2, &apos;Marcus Johnson&apos;, &apos;2025-09-04&apos;, &apos;MegaCorp Financial signed for QuickReport Premium. 200 users, 2-year commitment at $42/user/month. TCV $201,600. Custom integration with their Bloomberg terminal data feed required. Implementation starts Oct 1st. Executive sponsor: David Kim, CFO.&apos;),
(3, &apos;Patricia Moore&apos;, &apos;2025-09-08&apos;, &apos;Renewal discussion with Global Industries for DataVault Enterprise. Current contract of 350 seats expires Dec 31. They want to expand to 600 seats and add the healthcare compliance module. Proposed pricing: $75/seat/month for 600 seats, 3-year term. TCV $1,620,000. Pending legal review of updated BAA.&apos;),
(4, &apos;Sandra Lee&apos;, &apos;2025-09-12&apos;, &apos;New customer TechStart Inc closed for DataVault Enterprise. 150 seats, 3-year term at $82/seat/month. TCV $443,880. SOC 2 documentation provided. Implementation timeline: 6 weeks starting Oct 15. Procurement contact: Emily Watson.&apos;),
(5, &apos;Marcus Johnson&apos;, &apos;2025-09-15&apos;, &apos;DigitalEdge Agency signed AdOptimizer Enterprise with custom attribution modeling. 10 managed accounts, $1,499/month flat rate, 1-year term with option to renew. TCV $17,988. Agency plans to expand to 45 accounts in Q2 2026. Founder Chris Martinez is very enthusiastic about the attribution improvements.&apos;),
(6, &apos;Sandra Lee&apos;, &apos;2025-09-18&apos;, &apos;NovaTech Industries purchased ContractManager Professional. 75 seats at $62/seat/month, 2-year term. TCV $111,600. Salesforce integration required for day-one launch. Processing 200+ contracts monthly currently using spreadsheets. General Counsel Rachel Adams leading internal rollout.&apos;),
(7, &apos;Patricia Moore&apos;, &apos;2025-09-20&apos;, &apos;Lost deal: ServiceFirst Ltd cancelling HelpDesk360 subscription. 8 months remaining on annual contract at $54/seat for 100 seats. Refund request of $43,200 pending finance approval. Customer cited routing accuracy issues, knowledge base relevance problems, and 3 outages. Risk of negative public review.&apos;),
(8, &apos;  Marcus Johnson&apos;, &apos;2025-09-22&apos;, &apos;Expansion deal with CloudNine Software for DevPipeline. Adding 80 European developer seats to existing 80 US seats. European deployment at $72/seat/month, 2-year aligned with US contract end. Additional TCV $138,240. CTO Michael Brown driving the expansion after successful US rollout.&apos;),
(9, &apos;Sandra Lee&apos;, &apos;2025-09-25&apos;, &apos;State of California DPT evaluating FormBuilder for government forms. WCAG 2.1 AA compliance confirmed. Potential 500-seat deployment at government rate of $15/seat/month, 5-year term. TCV $450,000. Requires VPAT documentation submission to procurement. Long sales cycle expected, 6-9 months.&apos;),
(10, &apos;Patricia Moore&apos;, &apos;2025-09-28&apos;, &apos;EuroCommerce BV billing dispute on PayFlow. Customer contract guarantees 1.5% FX rate on EUR/GBP but system charged 2.8% for September. Estimated overcharge: approximately $900 on 45K GBP volume. Finance investigating root cause. Treasury Manager Amanda Clark expects retroactive correction.&apos;),
(11, &apos;Marcus Johnson&apos;, &apos;2025-10-01&apos;, &apos;Regional Banking Alliance consortium deal for APIGateway Pro. 12 banks, centralized deployment, 4.2M daily API calls. PSD2 compliance required. Proposed tiered pricing based on volume: $15,000/month for the consortium. 3-year term. TCV $540,000. Technology Director Daniel Wilson coordinating across all 12 institutions.&apos;),
(12, &apos;Sandra Lee&apos;, &apos;2025-10-03&apos;, &apos;FreshFoods Distribution requesting InventoryTrack real-time sync upgrade. Current standard sync has 15-min delay between 4 warehouses causing perishable goods discrepancies. Upgrade to real-time tier: additional $20/warehouse/month. Annual incremental revenue: $960. Operations Manager Sophia Nguyen is the champion.&apos;),
(13, &apos;Patricia Moore&apos;, &apos;2025-10-05&apos;, &apos;GlobalShop expansion for ChatAssist multi-language support. Adding Portuguese and Thai to existing English and Spanish deployment. Current contract: 300 seats at $68/seat/month. Multi-language add-on: additional $12/seat/month. Added TCV for remaining 18 months: $64,800. VP Customer Experience Steven Moore confirmed budget approval.&apos;),
(14, &apos;Marcus Johnson&apos;, &apos;2025-10-06&apos;, &apos;PrecisionMfg Inc evaluating SchedulePro for 2000 employees across 3 US states. Complex requirements: rotating shifts, split shifts, on-call, multi-state overtime compliance for CA, NY, TX. Enterprise tier at $12/employee/month, 2-year term. TCV $576,000. HR Director Catherine Hall leading evaluation. POC planned for November.&apos;),
(15, &apos;Sandra Lee&apos;, &apos;2025-10-08&apos;, &apos;MIT academic licensing request for QuickReport. 35 researchers plus 120 graduate students. Academic program pricing: 70% discount, $14.99/seat/month for 155 seats. 1-year renewable. TCV $27,881. Dr. Jessica Young in Department of Data Science. Low revenue but high brand visibility in academic publications.&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 3: Build Bronze Views&lt;/h2&gt;
&lt;p&gt;Bronze views cast dates to timestamps and standardize column names.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.bronze.v_emails AS
SELECT
  email_id,
  CAST(received_date AS TIMESTAMP) AS received_timestamp,
  email_body
FROM aigenerateexp.document_data.raw_emails;

CREATE OR REPLACE VIEW aigenerateexp.bronze.v_contracts AS
SELECT
  note_id,
  account_manager,
  CAST(note_date AS TIMESTAMP) AS note_timestamp,
  note_text
FROM aigenerateexp.document_data.contract_notes;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 4: Build Silver Views&lt;/h2&gt;
&lt;p&gt;This Silver view provides the unified email data that Gold views will process. At this stage, we simply promote the Bronze view for downstream extraction.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.silver.v_email_pipeline AS
SELECT
  email_id,
  received_timestamp,
  email_body
FROM aigenerateexp.bronze.v_emails;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 5: Build Gold Views with AI_GENERATE&lt;/h2&gt;
&lt;h3&gt;Gold View 1: Email Information Extraction&lt;/h3&gt;
&lt;p&gt;This is the core use case for &lt;code&gt;AI_GENERATE&lt;/code&gt;. Each email contains a sender, their company, the topic, the urgency level, and an action item, but all of this is embedded in free-text prose. The &lt;code&gt;WITH SCHEMA&lt;/code&gt; clause tells the LLM exactly what fields to extract and what types to return.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.gold.v_email_extracted AS
SELECT
  email_id,
  received_timestamp,
  email_body,
  extracted.sender_name,
  extracted.company,
  extracted.topic,
  extracted.urgency,
  extracted.action_required
FROM (
  SELECT
    email_id,
    received_timestamp,
    email_body,
    AI_GENERATE(
      &apos;Extract the following information from this email. If a field is not present, return N/A.&apos;,
      email_body
      WITH SCHEMA (
        sender_name VARCHAR,
        company VARCHAR,
        topic VARCHAR,
        urgency VARCHAR,
        action_required VARCHAR
      )
    ) AS extracted
  FROM aigenerateexp.silver.v_email_pipeline
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The subquery calls &lt;code&gt;AI_GENERATE&lt;/code&gt; and aliases the result as &lt;code&gt;extracted&lt;/code&gt;. The outer query then expands the ROW using dot notation (&lt;code&gt;extracted.sender_name&lt;/code&gt;, &lt;code&gt;extracted.company&lt;/code&gt;, etc.). Each field becomes a regular column you can filter, group, or join on.&lt;/p&gt;
&lt;h3&gt;Gold View 2: Contract Detail Extraction&lt;/h3&gt;
&lt;p&gt;Contract notes contain structured deal information in natural language. &lt;code&gt;AI_GENERATE&lt;/code&gt; extracts the client name, product, seat count, contract value, term, and key contact into individual columns.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.gold.v_contract_details AS
SELECT
  note_id,
  account_manager,
  note_timestamp,
  note_text,
  details.client_name,
  details.product,
  details.seat_count,
  details.monthly_rate,
  details.contract_term_years,
  details.total_contract_value,
  details.key_contact,
  details.deal_status
FROM (
  SELECT
    note_id,
    account_manager,
    note_timestamp,
    note_text,
    AI_GENERATE(
      &apos;Extract deal information from this contract note. For total_contract_value use only the numeric amount. For deal_status classify as Won, Lost, Pending, or Expansion.&apos;,
      note_text
      WITH SCHEMA (
        client_name VARCHAR,
        product VARCHAR,
        seat_count INT,
        monthly_rate DECIMAL(10,2),
        contract_term_years INT,
        total_contract_value DECIMAL(12,2),
        key_contact VARCHAR,
        deal_status VARCHAR
      )
    ) AS details
  FROM aigenerateexp.bronze.v_contracts
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice the &lt;code&gt;WITH SCHEMA&lt;/code&gt; uses &lt;code&gt;INT&lt;/code&gt; for seat count, &lt;code&gt;DECIMAL&lt;/code&gt; for monetary values, and &lt;code&gt;VARCHAR&lt;/code&gt; for text fields. The LLM converts the free-text values to the types you specify. If a contract note says &amp;quot;75 seats,&amp;quot; the &lt;code&gt;seat_count&lt;/code&gt; column returns the integer &lt;code&gt;75&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;How WITH SCHEMA Changes the Output&lt;/h3&gt;
&lt;p&gt;Without &lt;code&gt;WITH SCHEMA&lt;/code&gt;, &lt;code&gt;AI_GENERATE&lt;/code&gt; returns a &lt;code&gt;VARCHAR&lt;/code&gt; with the LLM&apos;s freeform response. This is harder to work with downstream:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Without WITH SCHEMA: returns plain text
SELECT AI_GENERATE(
  &apos;Extract the sender name and company from this email&apos;,
  email_body
) AS raw_text
FROM aigenerateexp.bronze.v_emails
LIMIT 3;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The raw text might look like: &amp;quot;Sender: Sarah Chen, Company: Acme Corp&amp;quot; but there&apos;s no guarantee of consistent formatting across rows. With &lt;code&gt;WITH SCHEMA&lt;/code&gt;, every row returns the same column structure, making the output predictable and queryable.&lt;/p&gt;
&lt;h2&gt;Persisting Results with CTAS&lt;/h2&gt;
&lt;p&gt;Materialize your extracted data into Iceberg tables to avoid repeated LLM calls:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.gold.emails_extracted AS
SELECT * FROM aigenerateexp.gold.v_email_extracted;

CREATE TABLE aigenerateexp.gold.contracts_extracted AS
SELECT * FROM aigenerateexp.gold.v_contract_details;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once materialized, you can run standard SQL analytics on the extracted fields without incurring LLM token costs. Refresh the tables when new emails or contracts arrive.&lt;/p&gt;
&lt;h2&gt;Step 6: Enable AI-Generated Wikis and Tags&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Admin&lt;/strong&gt; in the left sidebar, then go to &lt;strong&gt;Project Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Preferences&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Scroll to the &lt;strong&gt;AI&lt;/strong&gt; section and enable &lt;strong&gt;Generate Wikis and Labels&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Catalog&lt;/strong&gt; and navigate to your Gold views under &lt;code&gt;aigenerateexp.gold&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Edit&lt;/strong&gt; button (pencil icon) next to each view.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Details&lt;/strong&gt; tab, click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for all Gold views.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Enhance the generated wikis with context like &amp;quot;sender_name and company are LLM-extracted from raw email text. Urgency is classified by the LLM based on language cues like &apos;urgent&apos;, &apos;critical&apos;, and &apos;immediate&apos;.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Step 7: Ask Questions with the AI Agent&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Which companies sent urgent emails?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_email_extracted&lt;/code&gt;, filters by &lt;code&gt;urgency&lt;/code&gt; containing &apos;urgent&apos; or &apos;critical&apos;, and returns the company names and topics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Show me a chart of email topics by urgency level&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent groups by &lt;code&gt;topic&lt;/code&gt; and &lt;code&gt;urgency&lt;/code&gt; in &lt;code&gt;v_email_extracted&lt;/code&gt; and creates a visualization showing which topics generate the most urgent communications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;List all won deals over $100,000 with their key contacts&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent filters &lt;code&gt;v_contract_details&lt;/code&gt; for &lt;code&gt;deal_status = &apos;Won&apos;&lt;/code&gt; and &lt;code&gt;total_contract_value &amp;gt; 100000&lt;/code&gt;, returning client names, products, values, and key contacts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Create a chart showing total contract value by account manager and deal status&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent creates a stacked bar chart from &lt;code&gt;v_contract_details&lt;/code&gt; comparing each account manager&apos;s total pipeline across Won, Lost, Pending, and Expansion statuses.&lt;/p&gt;
&lt;h2&gt;Processing Unstructured Files with AI_GENERATE and LIST_FILES&lt;/h2&gt;
&lt;p&gt;The examples above process text that&apos;s already stored in table columns. But many organizations have unstructured files, such as PDFs, text documents, images, and scanned invoices, sitting in object storage (S3, Azure Blob, GCS) that have never been queryable through SQL.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;code&gt;LIST_FILES&lt;/code&gt; table function bridges this gap. It recursively lists files from a connected source and returns metadata about each file. Combined with &lt;code&gt;AI_GENERATE&lt;/code&gt;, you can read file content and extract structured data from documents that were previously invisible to your analytics platform.&lt;/p&gt;
&lt;h3&gt;How LIST_FILES Works&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;LIST_FILES&lt;/code&gt; is a table function that returns metadata for files in a connected storage source:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT *
FROM TABLE(
  LIST_FILES(
    path =&amp;gt; &apos;your_s3_source.folder_name&apos;,
    recursive =&amp;gt; true
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function returns columns including the file source, path, size, and last modification time. This metadata feeds into &lt;code&gt;AI_GENERATE&lt;/code&gt; as file references.&lt;/p&gt;
&lt;h3&gt;Hypothetical Example: Invoice Processing&lt;/h3&gt;
&lt;p&gt;Suppose you have an S3 bucket connected to Dremio as a source called &lt;code&gt;company_s3&lt;/code&gt;, with a folder &lt;code&gt;/invoices/2025/&lt;/code&gt; containing PDF invoices from vendors. Here&apos;s how you&apos;d extract structured data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Step 1: List all invoice files
SELECT *
FROM TABLE(
  LIST_FILES(
    path =&amp;gt; &apos;company_s3.invoices.2025&apos;,
    recursive =&amp;gt; true
  )
);

-- Step 2: Extract structured data from each invoice
SELECT
  invoice_data.vendor_name,
  invoice_data.invoice_number,
  invoice_data.invoice_date,
  invoice_data.total_amount,
  invoice_data.currency,
  invoice_data.line_items
FROM (
  SELECT AI_GENERATE(
    &apos;Extract the vendor name, invoice number, date, total amount, currency, and a summary of line items from this invoice.&apos;,
    file_content
    WITH SCHEMA (
      vendor_name VARCHAR,
      invoice_number VARCHAR,
      invoice_date VARCHAR,
      total_amount DECIMAL(12,2),
      currency VARCHAR,
      line_items VARCHAR
    )
  ) AS invoice_data
  FROM TABLE(
    LIST_FILES(
      path =&amp;gt; &apos;company_s3.invoices.2025&apos;,
      recursive =&amp;gt; true
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hypothetical Example: Resume Screening&lt;/h3&gt;
&lt;p&gt;An HR team stores candidate resumes as PDFs in an S3 bucket. &lt;code&gt;AI_GENERATE&lt;/code&gt; extracts candidate information for structured analysis:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  candidate.full_name,
  candidate.email,
  candidate.years_experience,
  candidate.primary_skill,
  candidate.education_level,
  candidate.current_company
FROM (
  SELECT AI_GENERATE(
    &apos;Extract candidate information from this resume. For years_experience provide a numeric estimate.&apos;,
    file_content
    WITH SCHEMA (
      full_name VARCHAR,
      email VARCHAR,
      years_experience INT,
      primary_skill VARCHAR,
      education_level VARCHAR,
      current_company VARCHAR
    )
  ) AS candidate
  FROM TABLE(
    LIST_FILES(
      path =&amp;gt; &apos;hr_s3.resumes.2025_q4&apos;,
      recursive =&amp;gt; true
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hypothetical Example: Quarterly Report Analysis&lt;/h3&gt;
&lt;p&gt;Finance stores quarterly PDF reports from subsidiaries. Extract key financial metrics without manual reading:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  metrics.subsidiary_name,
  metrics.quarter,
  metrics.total_revenue,
  metrics.net_income,
  metrics.headcount,
  metrics.key_risks
FROM (
  SELECT AI_GENERATE(
    &apos;Extract financial summary data from this quarterly report.&apos;,
    file_content
    WITH SCHEMA (
      subsidiary_name VARCHAR,
      quarter VARCHAR,
      total_revenue DECIMAL(15,2),
      net_income DECIMAL(15,2),
      headcount INT,
      key_risks VARCHAR
    )
  ) AS metrics
  FROM TABLE(
    LIST_FILES(
      path =&amp;gt; &apos;finance_s3.quarterly_reports.2025&apos;,
      recursive =&amp;gt; true
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Materializing File Extraction Results&lt;/h3&gt;
&lt;p&gt;Once you&apos;ve extracted structured data from files, persist it as an Iceberg table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.gold.invoices_extracted AS
SELECT
  invoice_data.vendor_name,
  invoice_data.invoice_number,
  invoice_data.total_amount,
  invoice_data.currency
FROM (
  SELECT AI_GENERATE(
    &apos;Extract invoice details&apos;,
    file_content
    WITH SCHEMA (vendor_name VARCHAR, invoice_number VARCHAR, total_amount DECIMAL(12,2), currency VARCHAR)
  ) AS invoice_data
  FROM TABLE(LIST_FILES(path =&amp;gt; &apos;company_s3.invoices.2025&apos;, recursive =&amp;gt; true))
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a governed, queryable Iceberg table from raw PDF invoices. The table supports time travel, schema evolution, and ACID transactions. Build Reflections on it for dashboard acceleration.&lt;/p&gt;
&lt;h3&gt;Key Considerations for LIST_FILES + AI_GENERATE&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Source connectivity:&lt;/strong&gt; &lt;code&gt;LIST_FILES&lt;/code&gt; requires a connected storage source (S3, Azure Storage, GCS) in your Dremio project. The source must be configured with appropriate read permissions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File format support:&lt;/strong&gt; Dremio&apos;s AI functions can process text-based content including PDFs, text files, and document formats. The LLM interprets the file content and extracts fields per your schema definition.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Token costs:&lt;/strong&gt; Processing files through the LLM consumes tokens proportional to file size. Filter your &lt;code&gt;LIST_FILES&lt;/code&gt; results before passing them to &lt;code&gt;AI_GENERATE&lt;/code&gt; to avoid processing unnecessary files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Filter to only recent files before AI processing
SELECT AI_GENERATE(...)
FROM TABLE(LIST_FILES(path =&amp;gt; &apos;company_s3.invoices.2025&apos;, recursive =&amp;gt; true))
WHERE modification_time &amp;gt; TIMESTAMP &apos;2025-09-01 00:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Engine routing:&lt;/strong&gt; Use &lt;code&gt;query_calls_ai_functions()&lt;/code&gt; to route file processing queries to a dedicated engine, isolating heavy batch extraction from your regular analytical workloads.&lt;/p&gt;
&lt;h2&gt;Why Apache Iceberg Matters&lt;/h2&gt;
&lt;p&gt;Extracted data stored as Iceberg tables benefits from automated performance management. As your extraction pipeline grows from hundreds to thousands of documents, Iceberg&apos;s compaction, manifest optimization, and clustering keep query performance consistent without manual tuning.&lt;/p&gt;
&lt;h3&gt;Iceberg vs. Federated for AI_GENERATE Workloads&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use CTAS materialization when:&lt;/strong&gt; You&apos;re extracting from historical documents (past invoices, old contracts, archived emails). Run the extraction once, query the results forever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use live views when:&lt;/strong&gt; You need real-time extraction from a continuously updating text column in a federated database. Pair with manual Reflections to cache results at a controlled refresh interval, balancing extraction cost against data freshness.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Connect your real data sources&lt;/strong&gt; : replace simulated tables with federated connections to your email system, CRM, and document storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connect an S3 or Azure source&lt;/strong&gt; : enable &lt;code&gt;LIST_FILES&lt;/code&gt; processing on your actual unstructured file repositories&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add FGAC&lt;/strong&gt; : mask extracted PII fields (emails, phone numbers, names) for downstream consumers who shouldn&apos;t see personal data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Reflections&lt;/strong&gt; : create Reflections on CTAS-materialized extraction tables for fast dashboard queries at zero LLM cost&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your organization has unstructured text trapped in database columns or files sitting unanalyzed in object storage, &lt;code&gt;AI_GENERATE&lt;/code&gt; turns that text into structured, queryable, governed data. Define a schema, write a prompt, and run a query. The extraction happens inside your lakehouse with the same access controls and governance that apply to all your other data.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-generate-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start extracting structured data from your unstructured text.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Oracle Database to Dremio Cloud: Enterprise Analytics Without Data Movement</title><link>https://iceberglakehouse.com/posts/2026-03-connector-oracle/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-oracle/</guid><description>
Oracle Database runs the most critical enterprise applications in the world : ERP systems, financial ledgers, supply chain management, and HR platfor...</description><pubDate>Sun, 01 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Oracle Database runs the most critical enterprise applications in the world : ERP systems, financial ledgers, supply chain management, and HR platforms. These systems generate massive volumes of data that business teams want to analyze, but running analytical queries directly against Oracle is expensive (license costs scale with CPU usage), complex (Oracle-specific SQL dialects and tooling), and risky (heavy queries can impact transactional performance).&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Oracle Database and queries it in place using standard SQL. You don&apos;t need to license additional Oracle tools, build ETL pipelines, or export data to a separate warehouse. Dremio pushes filters and aggregations to Oracle, fetches only the results, and lets you join Oracle data with every other source in your organization in a single query.&lt;/p&gt;
&lt;p&gt;This guide walks through the complete setup, including Oracle-specific features like native encryption, user impersonation, service name configuration, and the extensive predicate pushdown support.&lt;/p&gt;
&lt;h2&gt;Why Oracle Users Need Dremio&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Oracle licensing costs make analytics expensive.&lt;/strong&gt; Oracle licenses are typically tied to CPU cores. Running analytical workloads on your production Oracle instance consumes CPU, which means higher licensing costs. Dremio&apos;s Reflections create pre-computed copies of frequently queried Oracle data. After the initial query, subsequent analytics hit the Reflection :  not Oracle ,  reducing CPU consumption and license exposure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-system analytics require ETL.&lt;/strong&gt; Your financial data is in Oracle, your CRM data is in PostgreSQL, and your marketing data is in S3. Without a federation layer, joining these requires building ETL pipelines that extract data from each source, transform it, and load it into a central warehouse. That&apos;s months of engineering work. Dremio federates across all three sources with a single SQL query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Oracle&apos;s analytical tooling is Oracle-specific.&lt;/strong&gt; Oracle Analytics Cloud, Oracle BI, and Oracle Data Integrator work well within the Oracle ecosystem but don&apos;t extend to non-Oracle data. Dremio provides a vendor-neutral SQL layer that works with any BI tool (Tableau, Power BI, Looker) via Arrow Flight or ODBC, covering Oracle and every other connected source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No semantic layer for AI.&lt;/strong&gt; Oracle tables use technical names and lack the business context that AI agents need to generate accurate SQL. Dremio&apos;s semantic layer lets you create views with business logic, attach wiki descriptions, and enable the AI Agent to answer questions like &amp;quot;What&apos;s our quarterly revenue by product line?&amp;quot; by understanding what &amp;quot;quarterly revenue&amp;quot; means from your metadata.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting Oracle to Dremio Cloud, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Oracle hostname or IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; : Oracle defaults to &lt;code&gt;1521&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service name&lt;/strong&gt; : the Oracle service name (not the SID) for your database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : an Oracle user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the relevant schemas&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : port 1521 must be reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-oracle-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Oracle to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Oracle Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the left sidebar and select &lt;strong&gt;Oracle&lt;/strong&gt; from the database source types.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;erp-oracle&lt;/code&gt; or &lt;code&gt;finance-oracle&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; The Oracle server hostname.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;1521&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Name:&lt;/strong&gt; The Oracle service name for your database.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable TLS encryption:&lt;/strong&gt; Toggle this on for encrypted connections over TLS.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Oracle Native Encryption:&lt;/strong&gt; If you don&apos;t use TLS, Oracle supports its own encryption protocol. Options are:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Accepted (default):&lt;/strong&gt; Allows both encrypted and unencrypted connections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Requested:&lt;/strong&gt; Prefers encryption but accepts unencrypted if not available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Required:&lt;/strong&gt; Only encrypted connections allowed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rejected:&lt;/strong&gt; No encryption.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can use either TLS or Oracle Native Encryption, but not both on the same source.&lt;/p&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Three options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Master Authentication:&lt;/strong&gt; Username and password entered directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Resource URL:&lt;/strong&gt; Password stored in AWS Secrets Manager, referenced by ARN.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kerberos:&lt;/strong&gt; For environments where Oracle is configured with Kerberos authentication.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;p&gt;Oracle has several unique advanced settings:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use timezone as connection region&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Uses the timezone to set the connection region&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Include synonyms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes Oracle synonyms visible as datasets in Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Map Oracle DATE to TIMESTAMP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Oracle&apos;s &lt;code&gt;DATE&lt;/code&gt; type includes time components. Enable this to expose them as &lt;code&gt;TIMESTAMP&lt;/code&gt; in Dremio instead of truncating to &lt;code&gt;DATE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch (default 200, set 0 for automatic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use LDAP Naming Services&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Authenticate via LDAP rather than Oracle&apos;s local user database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User Impersonation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Run queries under each Dremio user&apos;s own Oracle credentials (see below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. User Impersonation (Optional but Valuable)&lt;/h3&gt;
&lt;p&gt;Oracle supports user impersonation through proxy authentication. This means each Dremio user runs queries under their own Oracle username, with their own Oracle permissions, rather than sharing a single service account.&lt;/p&gt;
&lt;p&gt;To set this up:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Ensure each Dremio user has a matching username in Oracle.&lt;/li&gt;
&lt;li&gt;In Oracle, grant proxy authentication: &lt;code&gt;ALTER USER analyst_user GRANT CONNECT THROUGH dremio_service_user;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;In Dremio&apos;s source settings, enable &lt;strong&gt;User Impersonation&lt;/strong&gt; under Advanced Options.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly valuable in regulated industries where audit trails need to track which individual accessed which data.&lt;/p&gt;
&lt;h3&gt;6. Save the Connection&lt;/h3&gt;
&lt;p&gt;Configure Reflection Refresh, Metadata Refresh, and Privileges as needed, then click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Oracle Data from Dremio&lt;/h2&gt;
&lt;p&gt;Browse your Oracle schemas and tables, then run standard SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT department_id, department_name, manager_id, location_id
FROM &amp;quot;erp-oracle&amp;quot;.HR.DEPARTMENTS
WHERE location_id = 1700;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the &lt;code&gt;WHERE&lt;/code&gt; clause to Oracle and transfers only the matching rows.&lt;/p&gt;
&lt;h2&gt;Federate Oracle with Other Sources&lt;/h2&gt;
&lt;p&gt;Combine Oracle ERP data with S3 data and PostgreSQL data in one query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  d.department_name,
  COUNT(e.employee_id) AS headcount,
  AVG(e.salary) AS avg_salary,
  SUM(b.budget_amount) AS total_budget
FROM &amp;quot;erp-oracle&amp;quot;.HR.DEPARTMENTS d
JOIN &amp;quot;erp-oracle&amp;quot;.HR.EMPLOYEES e ON d.department_id = e.department_id
LEFT JOIN &amp;quot;finance-postgres&amp;quot;.budgets.dept_budgets b ON d.department_id = b.dept_id
GROUP BY d.department_name
ORDER BY total_budget DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Oracle handles the department-employee join (predicate pushdown), and Dremio handles the cross-source join with PostgreSQL budget data.&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown Support&lt;/h2&gt;
&lt;p&gt;Oracle has one of the most comprehensive pushdown profiles in Dremio. The engine offloads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All standard comparisons and logical operators&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregations:&lt;/strong&gt; SUM, AVG, COUNT, MIN, MAX, STDDEV, MEDIAN, VAR_POP, VAR_SAMP, COVAR_POP, COVAR_SAMP, PERCENTILE_CONT, PERCENTILE_DISC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math functions:&lt;/strong&gt; ABS, CEIL, FLOOR, ROUND, MOD, SQRT, POWER, LOG, EXP, SIGN, trigonometric functions (SIN, COS, TAN, ASIN, ACOS, ATAN, SINH, COSH, TANH)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;String functions:&lt;/strong&gt; CONCAT, SUBSTR, LENGTH, LOWER, UPPER, TRIM, REPLACE, REVERSE, LPAD, RPAD&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date functions:&lt;/strong&gt; DATE_ADD, DATE_SUB, DATE_TRUNC, EXTRACT, ADD_MONTHS, LAST_DAY, TO_CHAR, TO_DATE, TRUNC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This extensive pushdown support means Oracle does most of the heavy lifting for filtering and aggregation, and Dremio only transfers the summarized results across the network.&lt;/p&gt;
&lt;h2&gt;Data Type Mapping&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Oracle&lt;/th&gt;
&lt;th&gt;Dremio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NUMBER&lt;/td&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;Preserves precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VARCHAR2 / NVARCHAR2 / CHAR / NCHAR&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;DATE or TIMESTAMP&lt;/td&gt;
&lt;td&gt;Use advanced option to map to TIMESTAMP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BINARY_FLOAT&lt;/td&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BINARY_DOUBLE&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLOB / RAW / LONG RAW&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LONG&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INTERVALDS&lt;/td&gt;
&lt;td&gt;INTERVAL (day to seconds)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INTERVALYM&lt;/td&gt;
&lt;td&gt;INTERVAL (years to months)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;CLOB&lt;/code&gt;, &lt;code&gt;XMLTYPE&lt;/code&gt;, and Oracle spatial types are not supported through the connector.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over Oracle&lt;/h2&gt;
&lt;p&gt;Create views that translate Oracle&apos;s technical schema into business-friendly analytics:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.department_performance AS
SELECT
  d.department_name,
  COUNT(e.employee_id) AS employee_count,
  ROUND(AVG(e.salary), 2) AS avg_salary,
  MAX(e.hire_date) AS most_recent_hire,
  CASE
    WHEN COUNT(e.employee_id) &amp;gt; 50 THEN &apos;Large&apos;
    WHEN COUNT(e.employee_id) &amp;gt; 20 THEN &apos;Medium&apos;
    ELSE &apos;Small&apos;
  END AS department_size
FROM &amp;quot;erp-oracle&amp;quot;.HR.DEPARTMENTS d
LEFT JOIN &amp;quot;erp-oracle&amp;quot;.HR.EMPLOYEES e ON d.department_id = e.department_id
GROUP BY d.department_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Attach wiki context via the Catalog (edit pencil icon → Details → Generate Wiki/Tags) so the AI Agent can answer questions like &amp;quot;Which large departments have the highest average salary?&amp;quot;&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Oracle vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Oracle:&lt;/strong&gt; Actively transactional data (current orders, inventory, ledger entries), data that applications read and write frequently, data subject to Oracle-specific constraints and triggers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical archives (closed fiscal quarters, past-year orders), aggregated reporting tables, datasets queried heavily for analytics but rarely written.&lt;/p&gt;
&lt;p&gt;For data that stays in Oracle, create manual Reflections with a refresh schedule that balances data freshness against Oracle CPU usage. For migrated data, Dremio&apos;s Open Catalog provides automated compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Oracle Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions about Oracle data in plain English. An HR director asks &amp;quot;Which large departments have the highest average salary?&amp;quot; and the Agent generates accurate SQL by reading the wiki descriptions on your &lt;code&gt;department_performance&lt;/code&gt; view. The Agent understands what &amp;quot;large&amp;quot; means (employee_count &amp;gt; 50) because you&apos;ve defined it in the semantic layer.&lt;/p&gt;
&lt;p&gt;This is particularly valuable for Oracle environments where decades of institutional knowledge about schema structures, table naming conventions (like &lt;code&gt;HR.DEPARTMENTS&lt;/code&gt;), and column semantics lives in senior DBAs&apos; heads.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Oracle data through Dremio:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude, &lt;code&gt;https://chatgpt.com/connector_platform_oauth_redirect&lt;/code&gt; for ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A CFO asks Claude &amp;quot;Compare department headcount and budget utilization across our Oracle ERP&amp;quot; and gets governed, accurate results from Oracle data : without knowing SQL or Oracle table structures.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against Oracle data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify departments by operational health
SELECT
  department_name,
  employee_count,
  avg_salary,
  department_size,
  AI_CLASSIFY(
    &apos;Based on these HR metrics, classify the department health&apos;,
    &apos;Department: &apos; || department_name || &apos;, Employees: &apos; || CAST(employee_count AS VARCHAR) || &apos;, Avg Salary: $&apos; || CAST(avg_salary AS VARCHAR) || &apos;, Size: &apos; || department_size,
    ARRAY[&apos;Thriving&apos;, &apos;Stable&apos;, &apos;Understaffed&apos;, &apos;Needs Attention&apos;]
  ) AS department_health
FROM analytics.gold.department_performance;

-- Generate executive briefings from Oracle data
SELECT
  department_name,
  AI_GENERATE(
    &apos;Write a one-sentence executive summary for this department&apos;,
    &apos;Department: &apos; || department_name || &apos;, Headcount: &apos; || CAST(employee_count AS VARCHAR) || &apos;, Avg Salary: $&apos; || CAST(avg_salary AS VARCHAR) || &apos;, Most Recent Hire: &apos; || CAST(most_recent_hire AS VARCHAR)
  ) AS executive_summary
FROM analytics.gold.department_performance
WHERE department_size = &apos;Large&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference to categorize departments. &lt;code&gt;AI_GENERATE&lt;/code&gt; creates narrative summaries. Both run inline in your SQL queries, enriching Oracle data with AI.&lt;/p&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Oracle Database licensing is expensive : especially Enterprise Edition with Analytics and Diagnostics Packs. Reflections offload analytical queries:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : for HR data, daily; for financial data, match to reporting cycles&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools get sub-second responses from Reflections. Oracle focuses on transactional workloads. A department performance dashboard with hourly refreshes generates zero Oracle CPU consumption after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance on Oracle Data&lt;/h2&gt;
&lt;p&gt;Oracle has its own security model (Oracle Database Vault, VPD), but it doesn&apos;t extend to non-Oracle sources. Dremio&apos;s Fine-Grained Access Control (FGAC) provides unified governance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask salary data, employee SSNs, and performance ratings from specific roles. An HR generalist sees headcount but not compensation details.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Department-level access : a department manager sees only their department. Regional HR sees only their region.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Oracle, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector : no Oracle client needed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Oracle data from their IDE. Ask Copilot &amp;quot;Show me understaffed departments from Oracle HR&amp;quot; and get SQL generated from your semantic layer : without knowing Oracle schema conventions.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Oracle vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Oracle:&lt;/strong&gt; Transactional data for active applications, data with PL/SQL dependencies (stored procedures, triggers, packages), data subject to Oracle RAC clustering, data managed by Oracle GoldenGate replication.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical HR data and archives, closed fiscal year financials, data consumed by non-Oracle tools, datasets where Oracle per-core licensing exceeds analytical value. Migrated Iceberg tables get automatic compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in Oracle, create manual Reflections to reduce Oracle CPU load. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Oracle Database users pay a premium for their database&apos;s reliability and enterprise features. Dremio Cloud lets you extract analytical value from that data without additional Oracle licensing, ETL pipelines, or vendor-specific BI tools.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-oracle-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Oracle databases alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Generate Summaries and Insights with Dremio&apos;s AI_COMPLETE Function</title><link>https://iceberglakehouse.com/posts/2026-03-ai-ai-complete/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-ai-ai-complete/</guid><description>
Every data team has a version of this problem: a table full of raw data that needs human-readable summaries, translations, or narrative descriptions....</description><pubDate>Sun, 01 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every data team has a version of this problem: a table full of raw data that needs human-readable summaries, translations, or narrative descriptions. Product descriptions that need rewriting for a new market. Customer records that need one-sentence executive summaries. Support interactions that need post-call notes.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;AI_COMPLETE&lt;/code&gt; brings an LLM directly into your SQL query to produce that text. You write a prompt, pass in your data columns, and get generated text back as a &lt;code&gt;VARCHAR&lt;/code&gt;. No Python notebooks, no external APIs, no data exports.&lt;/p&gt;
&lt;p&gt;This tutorial builds a complete product analytics pipeline in a fresh Dremio Cloud account. You&apos;ll create sample product and sales data, build a medallion architecture, and use &lt;code&gt;AI_COMPLETE&lt;/code&gt; to generate product summaries, executive briefings, marketing copy, and translations, all inside SQL.&lt;/p&gt;
&lt;h2&gt;What You&apos;ll Build&lt;/h2&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A product catalog with 50+ products and 50+ sales records&lt;/li&gt;
&lt;li&gt;Bronze views that standardize raw data&lt;/li&gt;
&lt;li&gt;Silver views that compute sales metrics per product&lt;/li&gt;
&lt;li&gt;Gold views that use &lt;code&gt;AI_COMPLETE&lt;/code&gt; to generate summaries, marketing descriptions, and translated content&lt;/li&gt;
&lt;li&gt;Materialized Iceberg tables that persist generated text for dashboards&lt;/li&gt;
&lt;li&gt;Wiki metadata that enables the AI Agent to answer natural language questions about your enriched catalog&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-complete-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI enabled&lt;/strong&gt; : go to Admin → Project Settings → Preferences → AI section and enable AI features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Provider configured&lt;/strong&gt; : Dremio provides a hosted LLM by default, or connect your own (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Tables in the built-in Open Catalog use &lt;code&gt;folder.subfolder.table_name&lt;/code&gt; without a catalog prefix. External sources use &lt;code&gt;source_name.schema.table_name&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Understanding AI_COMPLETE&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AI_COMPLETE&lt;/code&gt; sends a prompt to your configured LLM and returns the generated text as a &lt;code&gt;VARCHAR&lt;/code&gt;. The function signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;AI_COMPLETE(
  [model_name VARCHAR,]
  prompt VARCHAR
) → VARCHAR
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;model_name&lt;/strong&gt; (optional) : specify a model like &lt;code&gt;&apos;openai.gpt-4o&apos;&lt;/code&gt;. Format is &lt;code&gt;modelProvider.modelName&lt;/code&gt;. If omitted, Dremio uses your default model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prompt&lt;/strong&gt; : the text instruction for the LLM. Typically you concatenate a task description with column values to give the model both the instruction and the data context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key difference from &lt;code&gt;AI_CLASSIFY&lt;/code&gt; is that &lt;code&gt;AI_COMPLETE&lt;/code&gt; returns free-text output. There&apos;s no array of allowed values. The LLM generates whatever text the prompt asks for: a summary, a translation, a paragraph, a sentence, or a structured response.&lt;/p&gt;
&lt;p&gt;This flexibility is both the strength and the risk. A well-crafted prompt produces consistent, useful output. A vague prompt produces inconsistent results. Prompt engineering matters here more than with classification.&lt;/p&gt;
&lt;h2&gt;Step 1: Create Your Folder Structure&lt;/h2&gt;
&lt;p&gt;Open the &lt;strong&gt;SQL Runner&lt;/strong&gt; from the left sidebar in Dremio Cloud:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS aicompleteexp;
CREATE FOLDER IF NOT EXISTS aicompleteexp.catalog_data;
CREATE FOLDER IF NOT EXISTS aicompleteexp.bronze;
CREATE FOLDER IF NOT EXISTS aicompleteexp.silver;
CREATE FOLDER IF NOT EXISTS aicompleteexp.gold;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 2: Seed Your Sample Data&lt;/h2&gt;
&lt;h3&gt;Products Table&lt;/h3&gt;
&lt;p&gt;This table simulates a SaaS product catalog with technical descriptions, pricing tiers, and categories. These descriptions are the raw material that &lt;code&gt;AI_COMPLETE&lt;/code&gt; will use to generate marketing copy and summaries.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aicompleteexp.catalog_data.products (
  product_id INT,
  product_name VARCHAR,
  category VARCHAR,
  description VARCHAR,
  price_monthly DECIMAL(10,2),
  launch_date DATE,
  target_audience VARCHAR
);

INSERT INTO aicompleteexp.catalog_data.products VALUES
(1, &apos;CloudSync Pro&apos;, &apos;Storage&apos;, &apos;Enterprise file synchronization platform supporting real-time sync across Windows Mac and Linux with conflict resolution selective sync and 256-bit AES encryption at rest and in transit&apos;, 29.99, &apos;2023-03-15&apos;, &apos;IT Teams&apos;),
(2, &apos;DataVault Enterprise&apos;, &apos;Security&apos;, &apos;Zero-knowledge encrypted cloud storage with SOC 2 Type II certification automated backup deduplication granular access controls and 99.99% uptime SLA for regulated industries&apos;, 89.99, &apos;2022-11-01&apos;, &apos;Compliance Officers&apos;),
(3, &apos;QuickReport&apos;, &apos;Analytics&apos;, &apos;Self-service business intelligence tool with drag-and-drop report builder 50+ chart types scheduled email delivery PDF export and REST API for automated report generation&apos;, 49.99, &apos;2024-01-20&apos;, &apos;Business Analysts&apos;),
(4, &apos;DevPipeline&apos;, &apos;DevOps&apos;, &apos;CI/CD platform with parallel build execution Docker and Kubernetes native deployment auto-scaling runners built-in secret management and integration with GitHub GitLab and Bitbucket&apos;, 79.99, &apos;2023-07-10&apos;, &apos;Engineering Teams&apos;),
(5, &apos;MailForge&apos;, &apos;Marketing&apos;, &apos;Email marketing automation platform with AI-powered subject line optimization A/B testing dynamic content personalization and real-time deliverability monitoring across 50+ ISPs&apos;, 39.99, &apos;2024-05-01&apos;, &apos;Marketing Teams&apos;),
(6, &apos;HelpDesk360&apos;, &apos;Support&apos;, &apos;Omnichannel customer support platform supporting email chat phone and social media with SLA tracking auto-routing knowledge base integration and customer satisfaction scoring&apos;, 59.99, &apos;2023-09-15&apos;, &apos;Support Managers&apos;),
(7, &apos;FormBuilder&apos;, &apos;Productivity&apos;, &apos;No-code form and survey creation tool with conditional logic payment collection 200+ templates analytics dashboard and WCAG 2.1 AA accessibility compliance&apos;, 19.99, &apos;2024-02-28&apos;, &apos;Operations Teams&apos;),
(8, &apos;APIGateway Pro&apos;, &apos;Infrastructure&apos;, &apos;API management platform with rate limiting OAuth 2.0 authentication request transformation caching analytics dashboard and support for REST GraphQL and gRPC protocols&apos;, 99.99, &apos;2023-01-05&apos;, &apos;Platform Engineers&apos;),
(9, &apos;InventoryTrack&apos;, &apos;Commerce&apos;, &apos;Multi-warehouse inventory management system with barcode scanning lot tracking reorder alerts multi-currency support and integration with Shopify WooCommerce and Amazon&apos;, 44.99, &apos;2024-04-10&apos;, &apos;E-commerce Managers&apos;),
(10, &apos;TeamBoard&apos;, &apos;Collaboration&apos;, &apos;Visual project management platform with Kanban Gantt and timeline views time tracking resource allocation dependencies and Slack Microsoft Teams integration&apos;, 24.99, &apos;2023-06-20&apos;, &apos;Project Managers&apos;),
(11, &apos;SecureSign&apos;, &apos;Legal&apos;, &apos;Electronic signature platform with legally binding signatures audit trails multi-party signing workflows template library and compliance with eIDAS UETA and ESIGN regulations&apos;, 34.99, &apos;2024-03-01&apos;, &apos;Legal Teams&apos;),
(12, &apos;DataStream&apos;, &apos;Data&apos;, &apos;Real-time data pipeline platform supporting Kafka Pulsar and Kinesis with schema registry exactly-once processing dead letter queues and built-in data quality checks&apos;, 149.99, &apos;2023-04-18&apos;, &apos;Data Engineers&apos;),
(13, &apos;AdOptimizer&apos;, &apos;Marketing&apos;, &apos;Cross-channel advertising platform with automated bid management audience segmentation attribution modeling creative testing and budget pacing across Google Facebook and LinkedIn&apos;, 199.99, &apos;2024-06-15&apos;, &apos;Performance Marketers&apos;),
(14, &apos;ContractManager&apos;, &apos;Legal&apos;, &apos;Contract lifecycle management platform with AI-assisted clause extraction version tracking approval workflows obligation monitoring and integration with Salesforce and HubSpot&apos;, 69.99, &apos;2023-08-22&apos;, &apos;Legal Operations&apos;),
(15, &apos;LogInsight&apos;, &apos;Infrastructure&apos;, &apos;Log aggregation and analysis platform with full-text search pattern detection anomaly alerts custom dashboards and retention policies supporting up to 10TB daily ingestion&apos;, 119.99, &apos;2023-02-14&apos;, &apos;SRE Teams&apos;),
(16, &apos;PayFlow&apos;, &apos;Finance&apos;, &apos;Payment processing platform with support for 135 currencies PCI DSS Level 1 compliance recurring billing invoice generation and fraud detection using ML models&apos;, 0.00, &apos;2024-01-10&apos;, &apos;Finance Teams&apos;),
(17, &apos;ChatAssist&apos;, &apos;Support&apos;, &apos;AI-powered chatbot platform with natural language understanding intent classification handoff to human agents conversation analytics and multi-language support for 40+ languages&apos;, 74.99, &apos;2024-07-01&apos;, &apos;Customer Experience&apos;),
(18, &apos;SchedulePro&apos;, &apos;HR&apos;, &apos;Employee scheduling platform with shift management availability tracking overtime calculation labor cost forecasting and integration with ADP Workday and BambooHR payroll systems&apos;, 14.99, &apos;2023-11-05&apos;, &apos;HR Managers&apos;),
(19, &apos;CloudBackup&apos;, &apos;Storage&apos;, &apos;Automated cloud backup solution with incremental backups point-in-time recovery cross-region replication ransomware protection and support for AWS Azure and GCP workloads&apos;, 54.99, &apos;2023-05-30&apos;, &apos;IT Administrators&apos;),
(20, &apos;DesignHub&apos;, &apos;Productivity&apos;, &apos;Collaborative design platform with real-time co-editing version history component libraries handoff-to-dev specs and integration with Figma Sketch and Adobe XD import&apos;, 29.99, &apos;2024-08-10&apos;, &apos;Design Teams&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Sales Data Table&lt;/h3&gt;
&lt;p&gt;This table tracks monthly sales performance for each product, giving us the raw numbers that &lt;code&gt;AI_COMPLETE&lt;/code&gt; will summarize into narrative insights.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aicompleteexp.catalog_data.sales_data (
  sale_id INT,
  product_id INT,
  month_year VARCHAR,
  units_sold INT,
  revenue DECIMAL(12,2),
  new_customers INT,
  churned_customers INT,
  region VARCHAR
);

INSERT INTO aicompleteexp.catalog_data.sales_data VALUES
(1, 1, &apos;2025-07&apos;, 342, 10256.58, 89, 12, &apos;North America&apos;),
(2, 1, &apos;2025-08&apos;, 389, 11666.11, 102, 8, &apos;North America&apos;),
(3, 1, &apos;2025-09&apos;, 415, 12446.85, 95, 15, &apos;Europe&apos;),
(4, 2, &apos;2025-07&apos;, 156, 14037.44, 34, 5, &apos;North America&apos;),
(5, 2, &apos;2025-08&apos;, 178, 16017.22, 41, 3, &apos;Europe&apos;),
(6, 2, &apos;2025-09&apos;, 201, 18087.99, 52, 7, &apos;North America&apos;),
(7, 3, &apos;2025-07&apos;, 267, 13346.33, 73, 18, &apos;North America&apos;),
(8, 3, &apos;2025-08&apos;, 234, 11697.66, 58, 22, &apos;Europe&apos;),
(9, 3, &apos;2025-09&apos;, 298, 14894.02, 81, 14, &apos;Asia Pacific&apos;),
(10, 4, &apos;2025-07&apos;, 123, 9837.77, 28, 4, &apos;North America&apos;),
(11, 4, &apos;2025-08&apos;, 145, 11598.55, 35, 6, &apos;Europe&apos;),
(12, 4, &apos;2025-09&apos;, 167, 13358.33, 42, 3, &apos;North America&apos;),
(13, 5, &apos;2025-07&apos;, 445, 17795.55, 112, 25, &apos;North America&apos;),
(14, 5, &apos;2025-08&apos;, 478, 19115.22, 98, 30, &apos;Europe&apos;),
(15, 5, &apos;2025-09&apos;, 512, 20475.88, 134, 19, &apos;Asia Pacific&apos;),
(16, 6, &apos;2025-07&apos;, 198, 11877.02, 45, 11, &apos;North America&apos;),
(17, 6, &apos;2025-08&apos;, 212, 12717.88, 52, 8, &apos;Europe&apos;),
(18, 6, &apos;2025-09&apos;, 189, 11331.11, 38, 16, &apos;North America&apos;),
(19, 7, &apos;2025-07&apos;, 567, 11334.33, 145, 32, &apos;North America&apos;),
(20, 7, &apos;2025-08&apos;, 612, 12234.88, 160, 28, &apos;Europe&apos;),
(21, 7, &apos;2025-09&apos;, 589, 11774.11, 138, 35, &apos;Asia Pacific&apos;),
(22, 8, &apos;2025-07&apos;, 89, 8899.11, 15, 2, &apos;North America&apos;),
(23, 8, &apos;2025-08&apos;, 95, 9499.05, 18, 1, &apos;Europe&apos;),
(24, 8, &apos;2025-09&apos;, 102, 10198.98, 22, 3, &apos;North America&apos;),
(25, 9, &apos;2025-07&apos;, 312, 14035.88, 78, 14, &apos;North America&apos;),
(26, 9, &apos;2025-08&apos;, 287, 12911.13, 65, 19, &apos;Europe&apos;),
(27, 9, &apos;2025-09&apos;, 345, 15520.55, 92, 11, &apos;Asia Pacific&apos;),
(28, 10, &apos;2025-07&apos;, 234, 5847.66, 67, 20, &apos;North America&apos;),
(29, 10, &apos;2025-08&apos;, 256, 6397.44, 72, 15, &apos;Europe&apos;),
(30, 10, &apos;2025-09&apos;, 278, 6946.22, 80, 18, &apos;Asia Pacific&apos;),
(31, 11, &apos;2025-07&apos;, 189, 6613.11, 48, 9, &apos;North America&apos;),
(32, 11, &apos;2025-08&apos;, 201, 7032.99, 55, 7, &apos;Europe&apos;),
(33, 11, &apos;2025-09&apos;, 223, 7802.77, 62, 5, &apos;North America&apos;),
(34, 12, &apos;2025-07&apos;, 67, 10049.33, 12, 1, &apos;North America&apos;),
(35, 12, &apos;2025-08&apos;, 78, 11699.22, 16, 2, &apos;Europe&apos;),
(36, 12, &apos;2025-09&apos;, 82, 12299.18, 19, 1, &apos;North America&apos;),
(37, 13, &apos;2025-07&apos;, 134, 26793.66, 28, 6, &apos;North America&apos;),
(38, 13, &apos;2025-08&apos;, 145, 28993.55, 32, 4, &apos;Europe&apos;),
(39, 13, &apos;2025-09&apos;, 167, 33393.33, 41, 8, &apos;North America&apos;),
(40, 14, &apos;2025-07&apos;, 112, 7838.88, 25, 5, &apos;North America&apos;),
(41, 14, &apos;2025-08&apos;, 128, 8959.72, 30, 3, &apos;Europe&apos;),
(42, 14, &apos;2025-09&apos;, 145, 10149.55, 38, 4, &apos;North America&apos;),
(43, 15, &apos;2025-07&apos;, 56, 6719.44, 10, 2, &apos;North America&apos;),
(44, 15, &apos;2025-08&apos;, 62, 7439.38, 13, 1, &apos;Europe&apos;),
(45, 15, &apos;2025-09&apos;, 71, 8519.29, 17, 2, &apos;North America&apos;),
(46, 16, &apos;2025-07&apos;, 890, 0.00, 234, 45, &apos;North America&apos;),
(47, 16, &apos;2025-08&apos;, 1023, 0.00, 267, 38, &apos;Europe&apos;),
(48, 16, &apos;2025-09&apos;, 1156, 0.00, 301, 52, &apos;Asia Pacific&apos;),
(49, 17, &apos;2025-07&apos;, 145, 10873.55, 38, 8, &apos;North America&apos;),
(50, 17, &apos;2025-08&apos;, 167, 12522.33, 45, 6, &apos;Europe&apos;),
(51, 17, &apos;2025-09&apos;, 189, 14173.11, 52, 10, &apos;Asia Pacific&apos;),
(52, 18, &apos;2025-07&apos;, 423, 6341.77, 110, 28, &apos;North America&apos;),
(53, 18, &apos;2025-08&apos;, 456, 6836.44, 118, 22, &apos;Europe&apos;),
(54, 18, &apos;2025-09&apos;, 489, 7331.11, 125, 30, &apos;Asia Pacific&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 3: Build Bronze Views&lt;/h2&gt;
&lt;p&gt;Bronze views standardize column names and cast dates to timestamps. No business logic at this layer.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.bronze.v_products AS
SELECT
  product_id,
  product_name,
  category,
  description,
  price_monthly,
  CAST(launch_date AS TIMESTAMP) AS launch_timestamp,
  target_audience
FROM aicompleteexp.catalog_data.products;

CREATE OR REPLACE VIEW aicompleteexp.bronze.v_sales AS
SELECT
  sale_id,
  product_id,
  month_year,
  units_sold,
  revenue,
  new_customers,
  churned_customers,
  region
FROM aicompleteexp.catalog_data.sales_data;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 4: Build Silver Views&lt;/h2&gt;
&lt;p&gt;This Silver view aggregates sales performance per product across all months, giving us total revenue, total units, average deal size, net customer growth, and growth rate. The &lt;code&gt;AI_COMPLETE&lt;/code&gt; function will use these metrics to generate narrative summaries.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.silver.v_product_performance AS
SELECT
  p.product_id,
  p.product_name,
  p.category,
  p.description,
  p.price_monthly,
  p.target_audience,
  COALESCE(SUM(s.units_sold), 0) AS total_units,
  COALESCE(SUM(s.revenue), 0) AS total_revenue,
  COALESCE(SUM(s.new_customers), 0) AS total_new_customers,
  COALESCE(SUM(s.churned_customers), 0) AS total_churned,
  COALESCE(SUM(s.new_customers), 0) - COALESCE(SUM(s.churned_customers), 0) AS net_customer_growth,
  CASE
    WHEN COALESCE(SUM(s.units_sold), 0) &amp;gt; 0
    THEN ROUND(COALESCE(SUM(s.revenue), 0) / SUM(s.units_sold), 2)
    ELSE 0
  END AS avg_revenue_per_unit,
  COUNT(DISTINCT s.region) AS regions_active
FROM aicompleteexp.bronze.v_products p
LEFT JOIN aicompleteexp.bronze.v_sales s ON p.product_id = s.product_id
GROUP BY p.product_id, p.product_name, p.category, p.description, p.price_monthly, p.target_audience;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 5: Build Gold Views with AI_COMPLETE&lt;/h2&gt;
&lt;h3&gt;Gold View 1: Executive Product Summaries&lt;/h3&gt;
&lt;p&gt;This view generates a one-sentence executive summary for each product based on its sales performance. Product managers use these summaries in weekly reports without manually writing them.&lt;/p&gt;
&lt;p&gt;The prompt includes specific data points (revenue, units, customer growth) so the LLM produces factual summaries rather than generic descriptions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.gold.v_product_summaries AS
SELECT
  product_id,
  product_name,
  category,
  total_revenue,
  total_units,
  net_customer_growth,
  AI_COMPLETE(
    &apos;Write a single-sentence executive summary for this product. Be specific with numbers. Product: &apos;
    || product_name
    || &apos;. Category: &apos; || category
    || &apos;. Total revenue: $&apos; || CAST(total_revenue AS VARCHAR)
    || &apos;. Units sold: &apos; || CAST(total_units AS VARCHAR)
    || &apos;. Net customer growth: &apos; || CAST(net_customer_growth AS VARCHAR)
    || &apos;. Target audience: &apos; || target_audience
  ) AS executive_summary
FROM aicompleteexp.silver.v_product_performance;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Gold View 2: Marketing Description Generator&lt;/h3&gt;
&lt;p&gt;This view transforms technical product descriptions into customer-facing marketing copy. The LLM rewrites the description in a style that emphasizes benefits rather than features, suitable for a product landing page.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.gold.v_marketing_copy AS
SELECT
  product_id,
  product_name,
  category,
  description AS technical_description,
  price_monthly,
  target_audience,
  AI_COMPLETE(
    &apos;Rewrite this technical product description as a compelling 2-3 sentence marketing paragraph for a product landing page. Focus on benefits not features. Avoid buzzwords like transformative or revolutionary. Product: &apos;
    || product_name
    || &apos;. Technical description: &apos; || description
    || &apos;. Price: $&apos; || CAST(price_monthly AS VARCHAR) || &apos;/month&apos;
    || &apos;. Target audience: &apos; || target_audience
  ) AS marketing_description
FROM aicompleteexp.bronze.v_products;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Gold View 3: Translated Descriptions&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;AI_COMPLETE&lt;/code&gt; handles translation by including the target language in the prompt. This view generates Spanish translations of product descriptions for a localization initiative.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.gold.v_spanish_catalog AS
SELECT
  product_id,
  product_name,
  description AS english_description,
  AI_COMPLETE(
    &apos;Translate this product description to Spanish. Return only the Spanish text, no explanations: &apos; || description
  ) AS spanish_description
FROM aicompleteexp.bronze.v_products;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Choosing the Right Model&lt;/h3&gt;
&lt;p&gt;For summarization tasks, you can specify a model optimized for speed or quality:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Use a specific high-quality model for executive summaries
SELECT
  product_name,
  AI_COMPLETE(
    &apos;openai.gpt-4o&apos;,
    &apos;Write a brief executive summary: Product &apos; || product_name
    || &apos; generated $&apos; || CAST(total_revenue AS VARCHAR) || &apos; in revenue&apos;
  ) AS summary
FROM aicompleteexp.silver.v_product_performance;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prompt Engineering Patterns&lt;/h3&gt;
&lt;p&gt;The quality of &lt;code&gt;AI_COMPLETE&lt;/code&gt; output depends heavily on prompt structure. Here are patterns that produce consistent results:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Be specific about format:&lt;/strong&gt; &amp;quot;Write a single-sentence summary&amp;quot; produces better output than &amp;quot;Summarize this.&amp;quot; Specify the expected length and format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Include constraints:&lt;/strong&gt; &amp;quot;Avoid buzzwords like transformative or revolutionary&amp;quot; steers the LLM away from generic marketing language. &amp;quot;Return only the Spanish text, no explanations&amp;quot; prevents the model from adding unwanted context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Provide data context:&lt;/strong&gt; Concatenate actual numbers into the prompt. &amp;quot;Total revenue: $34,000&amp;quot; gives the LLM facts to work with, reducing hallucination. Never ask the LLM to calculate; provide pre-computed metrics and ask it to narrate them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Test with LIMIT:&lt;/strong&gt; Before running &lt;code&gt;AI_COMPLETE&lt;/code&gt; on your full dataset, test with &lt;code&gt;LIMIT 5&lt;/code&gt; to check output quality and token costs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT product_name, AI_COMPLETE(&apos;Summarize in one sentence: &apos; || description)
FROM aicompleteexp.bronze.v_products
LIMIT 5;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Persisting Results with CTAS&lt;/h2&gt;
&lt;p&gt;LLM calls cost tokens on every execution. For dashboards or reports that display generated summaries, materialize the results:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aicompleteexp.gold.product_summaries_materialized AS
SELECT * FROM aicompleteexp.gold.v_product_summaries;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Refresh this table on a schedule (weekly, after each sales data update) to keep summaries current without running LLM calls on every dashboard load.&lt;/p&gt;
&lt;h2&gt;Step 6: Enable AI-Generated Wikis and Tags&lt;/h2&gt;
&lt;p&gt;Add metadata context to your Gold views so the AI Agent can answer questions about your enriched catalog:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Admin&lt;/strong&gt; in the left sidebar, then go to &lt;strong&gt;Project Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Preferences&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Scroll to the &lt;strong&gt;AI&lt;/strong&gt; section and enable &lt;strong&gt;Generate Wikis and Labels&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Catalog&lt;/strong&gt; and navigate to your Gold views under &lt;code&gt;aicompleteexp.gold&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Edit&lt;/strong&gt; button (pencil icon) next to each view.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Details&lt;/strong&gt; tab, click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for all Gold views.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To enhance the generated wiki, copy it into the AI Agent and ask for improvements. For example: &amp;quot;Add context explaining that the executive_summary column is generated by an LLM using actual revenue and customer data, and that summaries are refreshed weekly after the sales data pipeline runs.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Step 7: Ask Questions with the AI Agent&lt;/h2&gt;
&lt;p&gt;Navigate to the AI Agent and try these prompts:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Which products have the highest net customer growth?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_product_summaries&lt;/code&gt;, sorts by &lt;code&gt;net_customer_growth&lt;/code&gt;, and returns the top products with their AI-generated summaries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Show me a chart of total revenue by product category&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent groups products by &lt;code&gt;category&lt;/code&gt; in &lt;code&gt;v_product_performance&lt;/code&gt;, sums revenue, and generates a bar chart showing which categories drive the most revenue.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;List all products in the Security category with their marketing descriptions&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent filters &lt;code&gt;v_marketing_copy&lt;/code&gt; for &lt;code&gt;category = &apos;Security&apos;&lt;/code&gt; and returns product names alongside the LLM-generated marketing paragraphs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Create a chart comparing new customers vs churned customers by product&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_product_performance&lt;/code&gt; and creates a grouped bar chart showing customer acquisition and churn side by side, making it easy to spot products with healthy vs. concerning net growth.&lt;/p&gt;
&lt;h2&gt;Why Apache Iceberg Matters&lt;/h2&gt;
&lt;p&gt;Your materialized summary tables are Apache Iceberg tables in the built-in Open Catalog. This means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time travel:&lt;/strong&gt; Compare this week&apos;s AI-generated summaries with last week&apos;s to see how the narrative changed as sales data evolved&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Add new generated columns (like translations to additional languages) without rewriting existing data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ACID transactions:&lt;/strong&gt; CTAS jobs write atomically; dashboards never see partial results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg vs. Federated for AI_COMPLETE Workloads&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Keep data federated when:&lt;/strong&gt; Your source data updates frequently and you want the latest products or sales figures in real-time queries. Use manual Reflections to cache results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg when:&lt;/strong&gt; You&apos;re generating summaries on historical data or building a curated catalog of marketing copy. CTAS materializes the generated text once, and Iceberg&apos;s automated performance management (compaction, manifest optimization) keeps the table fast as it grows.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Connect your real product database&lt;/strong&gt; : replace simulated tables with federated connections to your actual catalog and CRM&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate descriptions in multiple languages&lt;/strong&gt; : create additional Gold views with French, German, or Japanese translations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add FGAC&lt;/strong&gt; : mask revenue numbers in generated summaries for roles that shouldn&apos;t see financial data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Reflections&lt;/strong&gt; : create Reflections on materialized tables to accelerate dashboard queries at zero LLM cost&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your team spends time manually writing product summaries, translating content, or creating executive briefings from raw data, &lt;code&gt;AI_COMPLETE&lt;/code&gt; automates that work inside the same SQL engine where your data already lives. Write a prompt, run a query, and get your generated text in the same governed platform where everything else runs.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-complete-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start generating insights with SQL.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect MySQL to Dremio Cloud: Federated Analytics Without ETL</title><link>https://iceberglakehouse.com/posts/2026-03-connector-mysql/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-mysql/</guid><description>
MySQL runs more web applications, SaaS platforms, and e-commerce backends than any other database. It&apos;s fast for transactional reads and writes, but ...</description><pubDate>Sun, 01 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;MySQL runs more web applications, SaaS platforms, and e-commerce backends than any other database. It&apos;s fast for transactional reads and writes, but it becomes a bottleneck when your data team needs to run analytical queries, join MySQL data with other sources, or build dashboards that don&apos;t compete with application traffic.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to MySQL and queries it in place. Your data stays where it is. Dremio pushes filters (called predicate pushdowns) to MySQL when possible, joins MySQL data with any other connected source, and accelerates repeated queries with pre-computed Reflections so your production database isn&apos;t hit by every dashboard refresh.&lt;/p&gt;
&lt;p&gt;This guide covers everything from prerequisites to federated queries across MySQL and your other data sources.&lt;/p&gt;
&lt;h2&gt;Why MySQL Users Need Dremio&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Analytics compete with application traffic.&lt;/strong&gt; MySQL was built for OLTP (Online Transaction Processing) : fast inserts, updates, and single-row lookups. Analytical queries that scan millions of rows, compute aggregations, or join large tables create lock contention and slow down application responses. Dremio&apos;s Reflections solve this: after the first query, analytical workloads hit Dremio&apos;s pre-computed cache instead of MySQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data is siloed.&lt;/strong&gt; Your order data is in MySQL, customer engagement data is in MongoDB, and marketing attribution data is in S3. Joining these requires building ETL pipelines that extract, transform, and load data into a central warehouse. Dremio eliminates this by querying each source in place and joining the results in its query engine. One SQL query, multiple sources.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read replicas are expensive and complex.&lt;/strong&gt; The common workaround for MySQL analytics is creating a read replica. This adds infrastructure cost, replication lag, and operational complexity. Dremio&apos;s Reflections provide the same benefit (offloading analytical reads) without a separate database instance. The query optimizer transparently serves results from Reflections when they match.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No built-in semantic layer.&lt;/strong&gt; MySQL tables have raw column names and no business context. Dremio lets you create views with business logic (like defining what &amp;quot;active customer&amp;quot; means), attach wiki descriptions and labels to those views, and then let the AI Agent answer questions in plain English based on that context.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting MySQL to Dremio Cloud, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MySQL hostname or IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; : MySQL defaults to &lt;code&gt;3306&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : a MySQL user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; : port &lt;code&gt;3306&lt;/code&gt; must be reachable from Dremio Cloud. Open the port in your AWS Security Group, Azure NSG, or firewall configuration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mysql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect MySQL to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the MySQL Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar, then select &lt;strong&gt;MySQL&lt;/strong&gt; under database sources. Alternatively, go to &lt;strong&gt;Databases&lt;/strong&gt; and click &lt;strong&gt;Add database&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;ecommerce-mysql&lt;/code&gt;). This name appears in SQL queries as the source prefix. Cannot include &lt;code&gt;/&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, or &lt;code&gt;]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your MySQL server&apos;s hostname (e.g., &lt;code&gt;my-rds-instance.abc123.us-east-1.rds.amazonaws.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;3306&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database (optional):&lt;/strong&gt; Specify a single database to connect to, or leave blank to access all databases the user can see.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No Authentication:&lt;/strong&gt; For development instances with no password requirement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Master Credentials:&lt;/strong&gt; Enter the MySQL username and password with &lt;code&gt;SELECT&lt;/code&gt; permissions on your tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Net write timeout (in seconds)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How long to wait for data from MySQL before dropping the connection.&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch. Set to 0 for automatic.&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle connection pool size.&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection idle time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close.&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query timeout (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum query execution time before cancellation.&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC connection key-value pairs.&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection Refresh:&lt;/strong&gt; Controls how often Dremio re-queries MySQL to update pre-computed Reflections. For frequently changing data, set to 1-4 hours. For stable data, daily or weekly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Refresh:&lt;/strong&gt; Controls how often Dremio checks for new tables or schema changes. Default is 1 hour for both discovery and details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict which Dremio users can access this source. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query MySQL Data in Dremio&lt;/h2&gt;
&lt;p&gt;Once connected, browse your MySQL schemas and tables in the SQL Runner:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT order_id, customer_id, total_amount, order_date, status
FROM &amp;quot;ecommerce-mysql&amp;quot;.shop.orders
WHERE order_date &amp;gt;= &apos;2024-06-01&apos;
  AND status = &apos;completed&apos;
ORDER BY total_amount DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the date filter, status filter, and sort to MySQL : only the matching rows are transferred.&lt;/p&gt;
&lt;h2&gt;Federate MySQL with Other Sources&lt;/h2&gt;
&lt;p&gt;Join MySQL order data with S3 clickstream data and PostgreSQL customer profiles in a single query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  c.customer_name,
  c.region,
  COUNT(o.order_id) AS total_orders,
  SUM(o.total_amount) AS total_revenue,
  COUNT(DISTINCT e.session_id) AS web_sessions
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
LEFT JOIN &amp;quot;ecommerce-mysql&amp;quot;.shop.orders o
  ON c.customer_id = o.customer_id
LEFT JOIN &amp;quot;s3-analytics&amp;quot;.clickstream.sessions e
  ON c.customer_id = e.user_id
GROUP BY c.customer_name, c.region
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;No ETL. No data warehouse loading. Three sources, one query.&lt;/p&gt;
&lt;h2&gt;Build Views and Enable the AI Agent&lt;/h2&gt;
&lt;p&gt;Create business-friendly views over MySQL data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.order_summary AS
SELECT
  o.order_id,
  o.customer_id,
  o.total_amount,
  CAST(o.order_date AS TIMESTAMP) AS order_timestamp,
  o.status AS order_status,
  CASE
    WHEN o.total_amount &amp;gt; 500 THEN &apos;High Value&apos;
    WHEN o.total_amount &amp;gt; 100 THEN &apos;Medium Value&apos;
    ELSE &apos;Standard&apos;
  END AS order_tier
FROM &amp;quot;ecommerce-mysql&amp;quot;.shop.orders o
WHERE o.status IN (&apos;completed&apos;, &apos;shipped&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on the view, go to the &lt;strong&gt;Details&lt;/strong&gt; tab, and click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This gives Dremio&apos;s AI Agent the context it needs to answer questions like &amp;quot;How many high-value orders shipped last month?&amp;quot;&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown: What Runs on MySQL&lt;/h2&gt;
&lt;p&gt;Dremio pushes a wide range of operations directly to MySQL, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Logical:&lt;/strong&gt; AND, OR, NOT&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comparisons:&lt;/strong&gt; =, !=, &amp;lt;, &amp;gt;, &amp;lt;=, &amp;gt;=, LIKE, NOT LIKE, IS NULL, IS NOT NULL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregations:&lt;/strong&gt; SUM, AVG, COUNT, MIN, MAX, STDDEV, VAR_POP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math:&lt;/strong&gt; ABS, CEIL, FLOOR, ROUND, MOD, SQRT, POWER, LOG, EXP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;String:&lt;/strong&gt; CONCAT, SUBSTR, LENGTH, LOWER, UPPER, TRIM, REPLACE, REVERSE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date/Time:&lt;/strong&gt; DATE_ADD, DATE_SUB, DATE_TRUNC, EXTRACT, TIMESTAMPADD, TIMESTAMPDIFF&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This minimizes data transfer between MySQL and Dremio. Only the results of pushed-down operations cross the network.&lt;/p&gt;
&lt;h2&gt;Data Type Mapping&lt;/h2&gt;
&lt;p&gt;Key MySQL-to-Dremio type conversions:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;MySQL&lt;/th&gt;
&lt;th&gt;Dremio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;INT / INTEGER&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;UNSIGNED converts to BIGINT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOUBLE / REAL&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VARCHAR / TEXT / CHAR&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;ENUM and SET also map to VARCHAR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATETIME / TIMESTAMP&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLOB / BINARY / VARBINARY&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIT&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TINYINT / SMALLINT / MEDIUMINT&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YEAR&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;MySQL-specific types like &lt;code&gt;JSON&lt;/code&gt; or &lt;code&gt;GEOMETRY&lt;/code&gt; are not supported through the connector.&lt;/p&gt;
&lt;h2&gt;MySQL vs. Iceberg: When to Migrate&lt;/h2&gt;
&lt;p&gt;Keep data in MySQL when it&apos;s actively written and read by your application. Migrate historical or analytical datasets to Apache Iceberg tables in Dremio&apos;s Open Catalog when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The data doesn&apos;t change often (closed orders, historical logs)&lt;/li&gt;
&lt;li&gt;You need time travel (query the table as of any past timestamp)&lt;/li&gt;
&lt;li&gt;You want automated performance management (compaction, manifest optimization)&lt;/li&gt;
&lt;li&gt;You want Autonomous Reflections (Dremio auto-creates materializations based on query patterns)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For data that&apos;s still being written by your app, query it through the MySQL connector and create manual Reflections with a refresh schedule that matches your freshness needs.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on MySQL Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions about MySQL data in plain English. A marketing manager asks &amp;quot;How many high-value orders shipped last month?&amp;quot; and the Agent generates the correct SQL by reading your view&apos;s wiki descriptions. It understands &amp;quot;high-value&amp;quot; means &lt;code&gt;total_amount &amp;gt; 500&lt;/code&gt; and &amp;quot;shipped&amp;quot; means &lt;code&gt;status = &apos;shipped&apos;&lt;/code&gt; because you defined those in the semantic layer.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI chat clients (Claude, ChatGPT) to your MySQL data through Dremio with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;An e-commerce manager asks Claude &amp;quot;What&apos;s our average order value by region this quarter from MySQL?&amp;quot; and gets governed, accurate results : no SQL required.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against MySQL data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify orders by likely customer intent
SELECT
  order_id,
  total_amount,
  order_tier,
  AI_CLASSIFY(
    &apos;Based on this order, classify the likely purchase motivation&apos;,
    &apos;Amount: $&apos; || CAST(total_amount AS VARCHAR) || &apos;, Status: &apos; || order_status || &apos;, Tier: &apos; || order_tier,
    ARRAY[&apos;Impulse Buy&apos;, &apos;Planned Purchase&apos;, &apos;Bulk Order&apos;, &apos;Reorder&apos;]
  ) AS purchase_motivation
FROM analytics.gold.order_summary
WHERE order_status = &apos;completed&apos;;

-- Generate order analysis summaries
SELECT
  DATE_TRUNC(&apos;week&apos;, order_timestamp) AS week,
  COUNT(*) AS orders,
  SUM(total_amount) AS revenue,
  AI_GENERATE(
    &apos;Write a one-sentence weekly sales summary&apos;,
    &apos;Orders: &apos; || CAST(COUNT(*) AS VARCHAR) || &apos;, Revenue: $&apos; || CAST(SUM(total_amount) AS VARCHAR) || &apos;, High Value Orders: &apos; || CAST(SUM(CASE WHEN order_tier = &apos;High Value&apos; THEN 1 ELSE 0 END) AS VARCHAR)
  ) AS weekly_summary
FROM analytics.gold.order_summary
GROUP BY DATE_TRUNC(&apos;week&apos;, order_timestamp)
ORDER BY week DESC
LIMIT 12;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference inline, categorizing each order. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces narrative summaries. Both enrich MySQL data with AI in real time.&lt;/p&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;MySQL is optimized for OLTP : row-level reads and writes. Analytical aggregation queries compete with application workloads. Dremio&apos;s Reflections offload these:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools get sub-second responses from Reflections. MySQL focuses on serving your application.&lt;/p&gt;
&lt;h2&gt;Governance on MySQL Data&lt;/h2&gt;
&lt;p&gt;MySQL has database-level grants but no column masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask customer email, payment details, or pricing from specific roles. A marketing analyst sees order counts but not individual customer data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by store, region, or department based on user role.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across MySQL, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query MySQL data from their IDE. Ask Copilot &amp;quot;Show me high-value orders from MySQL this week&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in MySQL vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in MySQL:&lt;/strong&gt; Transactional data for active applications, data with application-level foreign key constraints, operational data where real-time writes matter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical order archives, reporting data, data consumed by non-application tools, datasets where MySQL replication lag creates analytics latency. Migrated Iceberg tables get automatic compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in MySQL, create manual Reflections to offload analytical queries. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;MySQL users don&apos;t need to build ETL pipelines or provision a data warehouse to get analytical value from their data. Dremio Cloud connects to MySQL in minutes and gives you federation, acceleration, governance, and AI analytics on top.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mysql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your MySQL databases.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Classify Your Data with SQL: A Hands-On Guide to Dremio&apos;s AI_CLASSIFY Function</title><link>https://iceberglakehouse.com/posts/2026-03-ai-ai-classify/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-ai-ai-classify/</guid><description>
Most classification workflows require exporting data to Python, running a model, and importing results back into your warehouse. Dremio&apos;s `AI_CLASSIF...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Most classification workflows require exporting data to Python, running a model, and importing results back into your warehouse. Dremio&apos;s &lt;code&gt;AI_CLASSIFY&lt;/code&gt; function eliminates that entire pipeline. You write a SELECT statement, pass in your text and your categories, and the LLM assigns a label. The classified data stays in your lakehouse, governed and queryable immediately.&lt;/p&gt;
&lt;p&gt;This tutorial walks you through a complete classification pipeline using a fresh Dremio Cloud account. You&apos;ll create sample customer feedback data, build a medallion architecture (Bronze → Silver → Gold), and use &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to categorize reviews by sentiment, support tickets by department, and product issues by urgency, all inside SQL.&lt;/p&gt;
&lt;h2&gt;What You&apos;ll Build&lt;/h2&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A customer feedback dataset with 50+ product reviews and 50+ support tickets&lt;/li&gt;
&lt;li&gt;Bronze views that standardize raw data&lt;/li&gt;
&lt;li&gt;Silver views that join reviews with ticket information&lt;/li&gt;
&lt;li&gt;Gold views that use &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to add sentiment labels, department routing, and urgency tiers&lt;/li&gt;
&lt;li&gt;An Iceberg table that persists your classified data for dashboards&lt;/li&gt;
&lt;li&gt;Wiki metadata that enables the AI Agent to answer natural language questions about your classified data&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-classify-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI enabled&lt;/strong&gt; : go to Admin → Project Settings → Preferences → AI section and enable AI features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Provider configured&lt;/strong&gt; : Dremio provides a hosted LLM by default, or you can connect your own (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI) under the AI preferences&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Tables in the built-in Open Catalog use &lt;code&gt;folder.subfolder.table_name&lt;/code&gt; without a catalog prefix. External sources use &lt;code&gt;source_name.schema.table_name&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Understanding AI_CLASSIFY&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; sends text to a configured LLM and asks it to pick the best matching label from an array you provide. The function signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;AI_CLASSIFY(
  [model_name VARCHAR,]
  prompt VARCHAR,
  categories ARRAY&amp;lt;VARCHAR|INT|FLOAT|BOOLEAN&amp;gt;
) → VARCHAR|INT|FLOAT|BOOLEAN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;model_name&lt;/strong&gt; (optional) : specify a particular model like &lt;code&gt;&apos;gpt.4o&apos;&lt;/code&gt;. Format is &lt;code&gt;modelProvider.modelName&lt;/code&gt;. If omitted, Dremio uses your default configured model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prompt&lt;/strong&gt; : the text you want classified. This is typically a column value or a concatenation of columns that gives the LLM enough context.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;categories&lt;/strong&gt; : an &lt;code&gt;ARRAY&lt;/code&gt; of possible labels. The LLM must return one of these values. Supports &lt;code&gt;VARCHAR&lt;/code&gt;, &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;FLOAT&lt;/code&gt;, and &lt;code&gt;BOOLEAN&lt;/code&gt; types.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The return type matches the array element type. If you pass &lt;code&gt;ARRAY[&apos;Positive&apos;, &apos;Negative&apos;, &apos;Neutral&apos;]&lt;/code&gt;, you get a &lt;code&gt;VARCHAR&lt;/code&gt; back. If you pass &lt;code&gt;ARRAY[1, 2, 3, 4, 5]&lt;/code&gt;, you get an &lt;code&gt;INT&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Step 1: Create Your Folder Structure&lt;/h2&gt;
&lt;p&gt;Open the &lt;strong&gt;SQL Runner&lt;/strong&gt; from the left sidebar in Dremio Cloud and run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS aiclassifyexp;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.feedback_data;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.bronze;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.silver;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.gold;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a namespace that simulates a customer feedback analytics pipeline with separate layers for raw data, standardized views, business logic, and final outputs.&lt;/p&gt;
&lt;h2&gt;Step 2: Seed Your Sample Data&lt;/h2&gt;
&lt;h3&gt;Customer Reviews Table&lt;/h3&gt;
&lt;p&gt;This table simulates product reviews collected from an e-commerce platform. Each review includes the customer name, product, a star rating, and the actual review text that we&apos;ll classify.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aiclassifyexp.feedback_data.customer_reviews (
  review_id INT,
  customer_name VARCHAR,
  product_name VARCHAR,
  star_rating INT,
  review_text VARCHAR,
  review_date DATE
);

INSERT INTO aiclassifyexp.feedback_data.customer_reviews VALUES
(1, &apos;Sarah Chen&apos;, &apos;CloudSync Pro&apos;, 5, &apos;Absolutely love this product. Setup took 5 minutes and sync speeds are incredible. Best purchase this year.&apos;, &apos;2025-08-15&apos;),
(2, &apos;James Rodriguez&apos;, &apos;CloudSync Pro&apos;, 1, &apos;Terrible experience. Lost three days of data after the last update. Support was unhelpful and dismissive.&apos;, &apos;2025-08-22&apos;),
(3, &apos;Emily Watson&apos;, &apos;DataVault Enterprise&apos;, 4, &apos;Solid encryption and good performance. The UI could use some polish but the core functionality is reliable.&apos;, &apos;2025-09-01&apos;),
(4, &apos;Michael Brown&apos;, &apos;CloudSync Pro&apos;, 3, &apos;It works fine most of the time but crashes occasionally when syncing large folders. Average product.&apos;, &apos;2025-09-05&apos;),
(5, &apos;Lisa Park&apos;, &apos;DataVault Enterprise&apos;, 5, &apos;Our security team approved this after a thorough review. Encryption standards exceed our compliance requirements.&apos;, &apos;2025-09-10&apos;),
(6, &apos;David Kim&apos;, &apos;QuickReport&apos;, 2, &apos;The reports look nice but generation takes forever. For the price point there are faster alternatives.&apos;, &apos;2025-09-12&apos;),
(7, &apos;Anna Kowalski&apos;, &apos;QuickReport&apos;, 4, &apos;Great templates and easy export options. Scheduling could be more flexible but overall a good tool.&apos;, &apos;2025-09-18&apos;),
(8, &apos;Robert Taylor&apos;, &apos;CloudSync Pro&apos;, 1, &apos;Second time this month it corrupted my files during sync. Considering switching to a competitor.&apos;, &apos;2025-09-20&apos;),
(9, &apos;Maria Garcia&apos;, &apos;DataVault Enterprise&apos;, 5, &apos;Migrated 50TB without a single issue. The deduplication feature alone saved us $2000/month in storage.&apos;, &apos;2025-09-25&apos;),
(10, &apos;Tom Williams&apos;, &apos;QuickReport&apos;, 3, &apos;Decent for basic reports. Falls short on complex multi-source dashboards. Not bad, not great.&apos;, &apos;2025-10-01&apos;),
(11, &apos;Jennifer Lee&apos;, &apos;CloudSync Pro&apos;, 4, &apos;Fast reliable syncing across all our devices. The mobile app needs improvement though.&apos;, &apos;2025-10-05&apos;),
(12, &apos;Chris Martinez&apos;, &apos;DataVault Enterprise&apos;, 2, &apos;Way too complicated for a small team. We spent two weeks just on initial configuration.&apos;, &apos;2025-10-08&apos;),
(13, &apos;Rachel Adams&apos;, &apos;QuickReport&apos;, 5, &apos;Finally a reporting tool that non-technical people can use. Our marketing team builds their own reports now.&apos;, &apos;2025-10-12&apos;),
(14, &apos;Kevin Thompson&apos;, &apos;CloudSync Pro&apos;, 1, &apos;Billing issue: charged twice and it took three weeks to get a refund. Product aside the billing system is broken.&apos;, &apos;2025-10-15&apos;),
(15, &apos;Sophia Nguyen&apos;, &apos;DataVault Enterprise&apos;, 4, &apos;Strong security features and audit logging. Integration with our SSO provider was straightforward.&apos;, &apos;2025-10-20&apos;),
(16, &apos;Daniel Wilson&apos;, &apos;QuickReport&apos;, 3, &apos;Good for monthly summaries but real-time dashboards lag noticeably. Suitable for batch reporting only.&apos;, &apos;2025-10-22&apos;),
(17, &apos;Amanda Clark&apos;, &apos;CloudSync Pro&apos;, 5, &apos;Our entire team switched from Dropbox. The conflict resolution on shared files is leagues better.&apos;, &apos;2025-10-25&apos;),
(18, &apos;Brian Harris&apos;, &apos;DataVault Enterprise&apos;, 1, &apos;Critical vulnerability found in version 3.2. Support acknowledged it but the patch took 6 weeks.&apos;, &apos;2025-10-28&apos;),
(19, &apos;Michelle Lopez&apos;, &apos;QuickReport&apos;, 4, &apos;Clean interface and the PDF export quality is excellent. API access for automation would be a welcome addition.&apos;, &apos;2025-11-01&apos;),
(20, &apos;Steven Moore&apos;, &apos;CloudSync Pro&apos;, 2, &apos;Sync works but the desktop app uses 800MB of RAM just sitting in the background. Needs optimization.&apos;, &apos;2025-11-05&apos;),
(21, &apos;Laura Jackson&apos;, &apos;DataVault Enterprise&apos;, 5, &apos;Passed our SOC 2 audit partly because of DataVault detailed access logs. Worth every penny.&apos;, &apos;2025-11-08&apos;),
(22, &apos;Andrew White&apos;, &apos;QuickReport&apos;, 2, &apos;Crashed twice during a client presentation. Embarrassing and unacceptable for a paid product.&apos;, &apos;2025-11-10&apos;),
(23, &apos;Catherine Hall&apos;, &apos;CloudSync Pro&apos;, 4, &apos;Selective sync feature is a lifesaver for laptops with small drives. Smart storage management.&apos;, &apos;2025-11-15&apos;),
(24, &apos;Mark Allen&apos;, &apos;DataVault Enterprise&apos;, 3, &apos;Good product hampered by poor documentation. We figured out most features through trial and error.&apos;, &apos;2025-11-18&apos;),
(25, &apos;Jessica Young&apos;, &apos;QuickReport&apos;, 5, &apos;The scheduled email reports feature saved our ops team 10 hours per week. Simple and effective.&apos;, &apos;2025-11-20&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Support Tickets Table&lt;/h3&gt;
&lt;p&gt;This table simulates a customer support system. Each ticket has a description written by the customer, a status, and a priority that was manually assigned. We&apos;ll use &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to automatically route these tickets by department.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aiclassifyexp.feedback_data.support_tickets (
  ticket_id INT,
  customer_name VARCHAR,
  product_name VARCHAR,
  ticket_description VARCHAR,
  manual_priority VARCHAR,
  ticket_status VARCHAR,
  created_date DATE,
  resolved_date DATE
);

INSERT INTO aiclassifyexp.feedback_data.support_tickets VALUES
(1001, &apos;James Rodriguez&apos;, &apos;CloudSync Pro&apos;, &apos;Lost all synced files after update 4.2.1. Need immediate recovery assistance.&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-08-20&apos;, &apos;2025-08-25&apos;),
(1002, &apos;Kevin Thompson&apos;, &apos;CloudSync Pro&apos;, &apos;Charged $49.99 twice on my credit card for October subscription. Need refund for duplicate charge.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-10-14&apos;, &apos;2025-11-04&apos;),
(1003, &apos;Robert Taylor&apos;, &apos;CloudSync Pro&apos;, &apos;Files corrupted during sync for the second time. Happening with files over 500MB.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-09-19&apos;, NULL),
(1004, &apos;Chris Martinez&apos;, &apos;DataVault Enterprise&apos;, &apos;Cannot figure out how to configure SSO integration. Documentation references outdated menu options.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-10-07&apos;, &apos;2025-10-10&apos;),
(1005, &apos;Brian Harris&apos;, &apos;DataVault Enterprise&apos;, &apos;Security scan flagged CVE-2025-1234 in version 3.2 encryption module. When will this be patched?&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-10-27&apos;, &apos;2025-12-08&apos;),
(1006, &apos;Andrew White&apos;, &apos;QuickReport&apos;, &apos;App crashes when rendering charts with more than 10000 data points. Happens consistently in Chrome.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-11-09&apos;, NULL),
(1007, &apos;Sarah Chen&apos;, &apos;CloudSync Pro&apos;, &apos;Would love to see a Linux desktop client. Currently only Windows and Mac are supported.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-08-30&apos;, NULL),
(1008, &apos;David Kim&apos;, &apos;QuickReport&apos;, &apos;Report generation takes 45+ seconds for simple 3-page reports. Was faster in the previous version.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-09-13&apos;, NULL),
(1009, &apos;Emily Watson&apos;, &apos;DataVault Enterprise&apos;, &apos;Need to add 50 new users to our plan. What are the volume discount options?&apos;, &apos;Low&apos;, &apos;Resolved&apos;, &apos;2025-09-03&apos;, &apos;2025-09-05&apos;),
(1010, &apos;Steven Moore&apos;, &apos;CloudSync Pro&apos;, &apos;Desktop app consuming excessive memory (800MB+). Running Windows 11 with 16GB RAM.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-11-04&apos;, NULL),
(1011, &apos;Lisa Park&apos;, &apos;DataVault Enterprise&apos;, &apos;Can we get a custom retention policy for healthcare compliance? HIPAA requires 7-year retention.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-09-12&apos;, &apos;2025-09-20&apos;),
(1012, &apos;Tom Williams&apos;, &apos;QuickReport&apos;, &apos;How do I connect QuickReport to a PostgreSQL database? Only seeing MySQL option in connectors.&apos;, &apos;Low&apos;, &apos;Resolved&apos;, &apos;2025-10-02&apos;, &apos;2025-10-03&apos;),
(1013, &apos;Mark Allen&apos;, &apos;DataVault Enterprise&apos;, &apos;API documentation has broken links on the authentication section. Pages return 404.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-11-17&apos;, NULL),
(1014, &apos;Michael Brown&apos;, &apos;CloudSync Pro&apos;, &apos;Selective sync keeps re-enabling folders I excluded. Happens after every app restart.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-09-06&apos;, NULL),
(1015, &apos;Daniel Wilson&apos;, &apos;QuickReport&apos;, &apos;Real-time dashboard shows data that is 15 minutes stale. Expected near real-time refresh.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-10-23&apos;, NULL),
(1016, &apos;Anna Kowalski&apos;, &apos;QuickReport&apos;, &apos;Can you add a dark mode option? The white background is hard on the eyes during evening work.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-09-19&apos;, NULL),
(1017, &apos;Sophia Nguyen&apos;, &apos;DataVault Enterprise&apos;, &apos;Our SSO integration broke after your last update. 200 users locked out for 4 hours.&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-10-21&apos;, &apos;2025-10-21&apos;),
(1018, &apos;Jennifer Lee&apos;, &apos;CloudSync Pro&apos;, &apos;Mobile app on iOS frequently logs me out. Have to re-authenticate 3-4 times per day.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-10-06&apos;, NULL),
(1019, &apos;Rachel Adams&apos;, &apos;QuickReport&apos;, &apos;Love the product! Any plans for a Slack integration to send report summaries to channels?&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-10-13&apos;, NULL),
(1020, &apos;Amanda Clark&apos;, &apos;CloudSync Pro&apos;, &apos;Conflict resolution dialog is confusing. Hard to tell which version is newer when filenames match.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-10-26&apos;, &apos;2025-10-30&apos;),
(1021, &apos;Catherine Hall&apos;, &apos;CloudSync Pro&apos;, &apos;Bandwidth throttling feature needed. Sync saturates our office internet during business hours.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-11-16&apos;, NULL),
(1022, &apos;Maria Garcia&apos;, &apos;DataVault Enterprise&apos;, &apos;Deduplication incorrectly merged two different client folders. Data was mixed across accounts.&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-09-26&apos;, &apos;2025-09-28&apos;),
(1023, &apos;Laura Jackson&apos;, &apos;DataVault Enterprise&apos;, &apos;Need export of all access logs for the past 12 months for our annual SOC 2 audit.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-11-09&apos;, &apos;2025-11-11&apos;),
(1024, &apos;Jessica Young&apos;, &apos;QuickReport&apos;, &apos;Scheduled reports occasionally skip a week. No error notification when this happens.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-11-21&apos;, NULL),
(1025, &apos;Michelle Lopez&apos;, &apos;QuickReport&apos;, &apos;Please add an API endpoint for programmatic report generation. We want to automate monthly client reports.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-11-02&apos;, NULL);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 3: Build Bronze Views&lt;/h2&gt;
&lt;p&gt;Bronze views standardize column names and data types without applying business logic. This creates a consistent foundation for downstream analysis.&lt;/p&gt;
&lt;p&gt;The reviews table needs its &lt;code&gt;DATE&lt;/code&gt; column cast to &lt;code&gt;TIMESTAMP&lt;/code&gt; for consistent joins later. The tickets table also needs date casting, and we rename &lt;code&gt;manual_priority&lt;/code&gt; to &lt;code&gt;assigned_priority&lt;/code&gt; to distinguish it from AI-generated classifications.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.bronze.v_reviews AS
SELECT
  review_id,
  customer_name,
  product_name,
  star_rating,
  review_text,
  CAST(review_date AS TIMESTAMP) AS review_timestamp
FROM aiclassifyexp.feedback_data.customer_reviews;

CREATE OR REPLACE VIEW aiclassifyexp.bronze.v_tickets AS
SELECT
  ticket_id,
  customer_name,
  product_name,
  ticket_description,
  manual_priority AS assigned_priority,
  ticket_status,
  CAST(created_date AS TIMESTAMP) AS created_timestamp,
  CAST(resolved_date AS TIMESTAMP) AS resolved_timestamp
FROM aiclassifyexp.feedback_data.support_tickets;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 4: Build Silver Views&lt;/h2&gt;
&lt;p&gt;This Silver view joins reviews with related support tickets for the same customer and product. This gives us a combined picture: what did the customer say in their review, and did they also file a support ticket? The &lt;code&gt;LEFT JOIN&lt;/code&gt; ensures we keep all reviews even if the customer never opened a ticket.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.silver.v_customer_feedback AS
SELECT
  r.review_id,
  r.customer_name,
  r.product_name,
  r.star_rating,
  r.review_text,
  r.review_timestamp,
  t.ticket_id,
  t.ticket_description,
  t.assigned_priority,
  t.ticket_status,
  t.created_timestamp AS ticket_created,
  t.resolved_timestamp AS ticket_resolved,
  CASE WHEN t.ticket_id IS NOT NULL THEN &apos;Yes&apos; ELSE &apos;No&apos; END AS has_support_ticket
FROM aiclassifyexp.bronze.v_reviews r
LEFT JOIN aiclassifyexp.bronze.v_tickets t
  ON r.customer_name = t.customer_name
  AND r.product_name = t.product_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 5: Build Gold Views with AI_CLASSIFY&lt;/h2&gt;
&lt;p&gt;This is where the AI functions do real work. Each Gold view applies &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to categorize text that would otherwise require manual review or an external ML pipeline.&lt;/p&gt;
&lt;h3&gt;Gold View 1: Sentiment Classification&lt;/h3&gt;
&lt;p&gt;This view classifies every review as Positive, Negative, or Neutral. Instead of relying solely on star ratings (which can be inconsistent with the actual text), the LLM reads the full review and assigns a sentiment label. We concatenate the product name with the review text to give the model full context.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.gold.v_review_sentiment AS
SELECT
  review_id,
  customer_name,
  product_name,
  star_rating,
  review_text,
  review_timestamp,
  AI_CLASSIFY(
    &apos;Classify the sentiment of this product review: &apos; || review_text,
    ARRAY[&apos;Positive&apos;, &apos;Negative&apos;, &apos;Neutral&apos;]
  ) AS ai_sentiment,
  has_support_ticket
FROM aiclassifyexp.silver.v_customer_feedback;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that we keep both &lt;code&gt;star_rating&lt;/code&gt; and &lt;code&gt;ai_sentiment&lt;/code&gt;. This lets you compare the two signals. A 3-star review with &amp;quot;Negative&amp;quot; AI sentiment suggests the customer is more frustrated than the rating alone indicates.&lt;/p&gt;
&lt;h3&gt;Gold View 2: Ticket Department Routing&lt;/h3&gt;
&lt;p&gt;This view uses &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to automatically route support tickets to the right department based on the ticket description. Instead of a human reading every ticket and assigning it, the LLM reads the description and selects from four departments.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.gold.v_ticket_routing AS
SELECT
  ticket_id,
  customer_name,
  product_name,
  ticket_description,
  assigned_priority,
  ticket_status,
  created_timestamp,
  resolved_timestamp,
  AI_CLASSIFY(
    &apos;Based on this support ticket, which department should handle it: &apos; || ticket_description,
    ARRAY[&apos;Billing&apos;, &apos;Technical Support&apos;, &apos;Feature Request&apos;, &apos;Account Management&apos;]
  ) AS ai_department,
  AI_CLASSIFY(
    &apos;Rate the urgency of this support ticket: &apos; || ticket_description,
    ARRAY[&apos;Critical&apos;, &apos;High&apos;, &apos;Medium&apos;, &apos;Low&apos;]
  ) AS ai_urgency
FROM aiclassifyexp.bronze.v_tickets;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This view applies two separate &lt;code&gt;AI_CLASSIFY&lt;/code&gt; calls on each row: one for department routing and one for urgency. You can compare &lt;code&gt;ai_urgency&lt;/code&gt; against the manually assigned &lt;code&gt;assigned_priority&lt;/code&gt; to find tickets where human triage may have underestimated or overestimated severity.&lt;/p&gt;
&lt;h3&gt;Using Numeric Categories&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; also supports numeric arrays. If you want a 1-5 satisfaction score instead of text labels:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  review_id,
  review_text,
  AI_CLASSIFY(
    &apos;Rate customer satisfaction from 1 (very dissatisfied) to 5 (very satisfied): &apos; || review_text,
    ARRAY[1, 2, 3, 4, 5]
  ) AS ai_satisfaction_score
FROM aiclassifyexp.bronze.v_reviews;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The LLM returns an &lt;code&gt;INT&lt;/code&gt; because the array contains integers. This is useful when you need numeric scores for aggregation, averages, or trend analysis.&lt;/p&gt;
&lt;h2&gt;Persisting Results with CTAS&lt;/h2&gt;
&lt;p&gt;AI function calls consume LLM tokens on every query execution. For dashboards or reports that run the same classification repeatedly, materialize the results into an Iceberg table with &lt;code&gt;CREATE TABLE AS SELECT&lt;/code&gt; (CTAS):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aiclassifyexp.gold.classified_reviews AS
SELECT * FROM aiclassifyexp.gold.v_review_sentiment;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a physical Iceberg table with the AI classifications baked in. Subsequent queries against &lt;code&gt;classified_reviews&lt;/code&gt; are standard SQL queries with no LLM cost. Refresh the table periodically (daily, weekly) as new reviews come in by running CTAS again with &lt;code&gt;CREATE OR REPLACE TABLE&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Managing AI Workloads&lt;/h2&gt;
&lt;p&gt;AI function queries are more resource-intensive than standard SQL. Dremio provides engine routing to isolate these workloads:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Dremio provides these routing functions for workload management:
-- query_calls_ai_functions() : returns true if the query uses AI functions
-- query_has_attribute(&apos;AI_FUNCTIONS&apos;) : same check, different syntax
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your Dremio Cloud project settings, you can create engine routing rules that automatically direct queries containing AI functions to a dedicated engine. This prevents a large classification batch job from competing with your executive dashboards for compute resources. Set up a separate engine with appropriate scaling for AI workloads, and create a routing rule using &lt;code&gt;query_calls_ai_functions()&lt;/code&gt; to send AI queries there automatically.&lt;/p&gt;
&lt;h2&gt;Choosing Your Model Provider&lt;/h2&gt;
&lt;p&gt;The optional &lt;code&gt;model_name&lt;/code&gt; parameter lets you target specific models for different tasks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Use a specific model for classification
SELECT AI_CLASSIFY(
  &apos;openai.gpt-4o&apos;,
  &apos;Classify this ticket: &apos; || ticket_description,
  ARRAY[&apos;Billing&apos;, &apos;Technical Support&apos;, &apos;Feature Request&apos;]
) AS department
FROM aiclassifyexp.bronze.v_tickets;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio supports multiple providers: OpenAI, Anthropic, Google Gemini, AWS Bedrock, and Azure OpenAI. You configure providers in Admin → Project Settings → Preferences → AI. The format is &lt;code&gt;providerName.modelName&lt;/code&gt;, where &lt;code&gt;providerName&lt;/code&gt; is the name you gave the provider during setup.&lt;/p&gt;
&lt;p&gt;If you skip &lt;code&gt;model_name&lt;/code&gt;, Dremio uses your default model. For most classification tasks, the default model works well. Specifying a model makes sense when you need a particular model&apos;s strengths (like a smaller, faster model for simple sentiment vs. a larger model for nuanced multi-class categorization).&lt;/p&gt;
&lt;h2&gt;Step 6: Enable AI-Generated Wikis and Tags&lt;/h2&gt;
&lt;p&gt;Good metadata makes the AI Agent more accurate when answering natural language questions. Here&apos;s how to add context to your Gold views:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Admin&lt;/strong&gt; in the left sidebar, then go to &lt;strong&gt;Project Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Preferences&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Scroll to the &lt;strong&gt;AI&lt;/strong&gt; section and enable &lt;strong&gt;Generate Wikis and Labels&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Catalog&lt;/strong&gt; and navigate to your Gold views under &lt;code&gt;aiclassifyexp.gold&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Edit&lt;/strong&gt; button (pencil icon) next to the desired view.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Details&lt;/strong&gt; tab, find the &lt;strong&gt;Wiki&lt;/strong&gt; section and click &lt;strong&gt;Generate Wiki&lt;/strong&gt;. Do the same for the &lt;strong&gt;Tags&lt;/strong&gt; section by clicking &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for each Gold view.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To enhance the generated wiki with additional business context, copy the output into the &lt;strong&gt;AI Agent&lt;/strong&gt; on the homepage and ask it to produce an improved version in a markdown code block. For example, ask the Agent to add details like &amp;quot;Positive sentiment reviews are candidates for testimonial collection. Negative sentiment reviews with support tickets should trigger a customer success outreach.&amp;quot; Copy the Agent&apos;s refined output and paste it back into the wiki editor.&lt;/p&gt;
&lt;p&gt;Wikis and labels are the context that Dremio&apos;s AI Agent reads before generating SQL. Better metadata produces more accurate natural language responses.&lt;/p&gt;
&lt;h2&gt;Step 7: Ask Questions with the AI Agent&lt;/h2&gt;
&lt;p&gt;With your classified data and enriched wikis in place, navigate to the AI Agent on the Dremio homepage and try these prompts:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Which products have the most negative reviews?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_review_sentiment&lt;/code&gt;, filters by &lt;code&gt;ai_sentiment = &apos;Negative&apos;&lt;/code&gt;, groups by &lt;code&gt;product_name&lt;/code&gt;, and returns a count. You&apos;ll see which products need attention based on LLM-analyzed sentiment rather than just star ratings.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Show me a chart of ticket routing by department&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_ticket_routing&lt;/code&gt;, groups by &lt;code&gt;ai_department&lt;/code&gt;, and generates a bar chart showing how tickets distribute across Billing, Technical Support, Feature Request, and Account Management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;List all critical urgency tickets that are still open, ordered by creation date&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent filters &lt;code&gt;v_ticket_routing&lt;/code&gt; for &lt;code&gt;ai_urgency = &apos;Critical&apos;&lt;/code&gt; and &lt;code&gt;ticket_status = &apos;Open&apos;&lt;/code&gt;, sorts by &lt;code&gt;created_timestamp&lt;/code&gt;, and returns the results. This surfaces tickets that AI flagged as critical but haven&apos;t been resolved.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Create a chart showing sentiment distribution by product and whether the customer has a support ticket&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent creates a multi-dimensional visualization from &lt;code&gt;v_review_sentiment&lt;/code&gt;, cross-referencing &lt;code&gt;product_name&lt;/code&gt;, &lt;code&gt;ai_sentiment&lt;/code&gt;, and &lt;code&gt;has_support_ticket&lt;/code&gt;. This reveals patterns like &amp;quot;CloudSync Pro has the most negative reviews among customers who also filed support tickets.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Why Apache Iceberg Matters&lt;/h2&gt;
&lt;p&gt;All the tables you created in this tutorial are Apache Iceberg tables stored in Dremio&apos;s built-in Open Catalog. Iceberg provides ACID transactions, schema evolution, and time travel, but the performance benefits are especially relevant for AI-classified data.&lt;/p&gt;
&lt;h3&gt;Iceberg vs. Federated for AI Workloads&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Keep data federated when:&lt;/strong&gt; Your classification needs real-time source data; for example, classifying support tickets as they arrive from a live PostgreSQL database. Use manual Reflections with a short refresh interval to accelerate federated queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg when:&lt;/strong&gt; You&apos;re running batch classification jobs on historical data. The CTAS approach above creates Iceberg tables. Iceberg&apos;s automated performance management (compaction, manifest optimization, clustering) keeps these growing tables fast. Autonomous Reflections can auto-create pre-computed materializations based on how your dashboards query the classified data.&lt;/p&gt;
&lt;h3&gt;Cost Optimization Pattern&lt;/h3&gt;
&lt;p&gt;Run &lt;code&gt;AI_CLASSIFY&lt;/code&gt; once via CTAS to materialize results. Build Reflections on the materialized table for dashboard queries. This pattern means you pay for LLM tokens once during classification, and all subsequent analytical queries hit cached Reflections at zero LLM cost.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Connect real data sources&lt;/strong&gt; : replace the &lt;code&gt;feedback_data&lt;/code&gt; folder with federated connections to your actual CRM, support platform, and review system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add Fine-Grained Access Control (FGAC)&lt;/strong&gt; : mask customer names or PII in classified results so analysts see sentiment patterns without accessing personal data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Experiment with Boolean classification&lt;/strong&gt; : use &lt;code&gt;ARRAY[true, false]&lt;/code&gt; for binary decisions like &amp;quot;Is this review about a security concern?&amp;quot; or &amp;quot;Does this ticket mention data loss?&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale with Reflections&lt;/strong&gt; : create Reflections on your materialized classification tables to accelerate dashboard queries&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you&apos;re running manual classification processes today, whether it&apos;s tagging support tickets, scoring reviews, or categorizing feedback, &lt;code&gt;AI_CLASSIFY&lt;/code&gt; replaces those workflows with a single SQL query. The classification runs inside the same platform where your data lives, governed by the same access controls, and immediately available to every BI tool and AI agent connected to Dremio.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-classify-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start classifying your data with SQL.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect PostgreSQL to Dremio Cloud: Query, Federate, and Accelerate Your Data</title><link>https://iceberglakehouse.com/posts/2026-03-connector-postgresql/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-postgresql/</guid><description>
PostgreSQL powers more production applications than almost any other open-source database. It&apos;s where your customer records, transaction logs, produc...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;PostgreSQL powers more production applications than almost any other open-source database. It&apos;s where your customer records, transaction logs, product catalogs, and operational data live. But running analytics directly against PostgreSQL creates problems: heavy analytical queries compete with transactional workloads, cross-database joins require custom ETL, and your data team can&apos;t access PostgreSQL data alongside data in S3, Snowflake, or other systems without building pipelines.&lt;/p&gt;
&lt;p&gt;Dremio Cloud solves this by connecting directly to PostgreSQL and querying it in place. No data movement, no ETL pipelines, no replica databases. You write SQL in Dremio, and it pushes filtering and aggregation work back to PostgreSQL when possible, fetches only the results, and lets you join that data with any other connected source in the same query.&lt;/p&gt;
&lt;p&gt;This guide walks through connecting PostgreSQL to Dremio Cloud, from prerequisites to your first federated query.&lt;/p&gt;
&lt;h2&gt;Why PostgreSQL Users Need Dremio&lt;/h2&gt;
&lt;p&gt;PostgreSQL is an excellent transactional database, but it wasn&apos;t designed for the analytics patterns that modern teams need. Here are the problems Dremio solves:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-source analytics without pipelines.&lt;/strong&gt; Your customer data is in PostgreSQL, your clickstream data is in S3, and your revenue data is in Snowflake. Without Dremio, joining these datasets requires building ETL pipelines to centralize everything into one system. With Dremio, you connect all three as sources and write a single SQL query that joins across them. Dremio handles the federation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Protect production performance.&lt;/strong&gt; Running heavy &lt;code&gt;GROUP BY&lt;/code&gt; queries or full-table scans against your production PostgreSQL instance can degrade application performance. Dremio&apos;s Reflections solve this by creating pre-computed materializations of your most common analytical queries. After the first query, subsequent queries hit the Reflection instead of PostgreSQL, eliminating load on your production database.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Business context for AI.&lt;/strong&gt; Raw PostgreSQL tables have technical column names like &lt;code&gt;cust_id&lt;/code&gt; and &lt;code&gt;txn_amt&lt;/code&gt;. Dremio&apos;s semantic layer lets you create views that rename and restructure these columns with business logic, then attach wiki descriptions and labels. When your team asks Dremio&apos;s AI Agent &amp;quot;Who are our highest-value customers?&amp;quot;, the Agent understands what &amp;quot;highest-value&amp;quot; means because you&apos;ve defined it in the semantic layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance without modifying PostgreSQL.&lt;/strong&gt; Dremio&apos;s Fine-Grained Access Control (FGAC) lets you mask sensitive columns (Social Security numbers, email addresses) and filter rows based on user roles. You don&apos;t need to modify PostgreSQL permissions or create restricted database views : the governance layer lives in Dremio and applies across all tools and users.&lt;/p&gt;
&lt;h2&gt;What You Need Before Connecting&lt;/h2&gt;
&lt;p&gt;Before configuring the connection in Dremio, make sure you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL hostname or IP address&lt;/strong&gt; : the network address of your database server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; : PostgreSQL defaults to &lt;code&gt;5432&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; : the specific database you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; : credentials for a user with read access to the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network accessibility&lt;/strong&gt; : Dremio Cloud connects to your PostgreSQL instance over the public internet by default. Ensure port 5432 (or your custom port) is open in your AWS Security Group, Azure NSG, or firewall rules&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your PostgreSQL instance is in a private subnet (common for production databases), you&apos;ll need to configure networking to allow Dremio Cloud to reach it. Check &lt;a href=&quot;https://docs.dremio.com/dremio-cloud/bring-data/connect/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-postgresql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s network connectivity documentation&lt;/a&gt; for options.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dremio Cloud account:&lt;/strong&gt; Sign up at &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-postgresql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;dremio.com/get-started&lt;/a&gt; for a free 30-day trial with $400 in compute credits.&lt;/p&gt;
&lt;h2&gt;Step-by-Step: Connect PostgreSQL to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add a New Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;PostgreSQL&lt;/strong&gt; from the database source types. Alternatively, navigate to &lt;strong&gt;Databases&lt;/strong&gt; in the data panel and click &lt;strong&gt;Add database&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;p&gt;Fill in the connection details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Enter a descriptive name for this source (e.g., &lt;code&gt;production-postgres&lt;/code&gt; or &lt;code&gt;crm-database&lt;/code&gt;). This name will appear in your SQL queries when referencing tables from this source. Note: the name cannot include &lt;code&gt;/&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, or &lt;code&gt;]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Enter your PostgreSQL hostname (e.g., &lt;code&gt;my-db.cluster-abc123.us-east-1.rds.amazonaws.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Enter the port number. The default is &lt;code&gt;5432&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; Enter the database name you want to connect to.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt connection:&lt;/strong&gt; Toggle this on to use SSL encryption between Dremio and PostgreSQL. Recommended for production connections, especially when connecting over the internet.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose one of two authentication methods:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Master Authentication (default):&lt;/strong&gt; Provide a username and password directly. This is the simplest option : enter the credentials for a PostgreSQL user that has &lt;code&gt;SELECT&lt;/code&gt; permissions on the tables you want to query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Secret Resource URL:&lt;/strong&gt; Instead of storing the password in Dremio, provide an AWS Secrets Manager ARN (e.g., &lt;code&gt;arn:aws:secretsmanager:us-west-2:123456789012:secret:my-rds-secret-VNenFy&lt;/code&gt;). Dremio fetches the password from Secrets Manager at connection time. This is the preferred option for production deployments because it centralizes credential management and supports rotation.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options (Optional)&lt;/h3&gt;
&lt;p&gt;The advanced options let you fine-tune connection behavior:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Number of rows Dremio fetches per batch. Set to 0 for automatic.&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How many idle connections Dremio maintains to PostgreSQL.&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before an idle connection is closed.&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encryption Validation Mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When SSL is enabled: validate certificate + hostname, certificate only, or no validation.&lt;/td&gt;
&lt;td&gt;Validate both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom key-value pairs for JDBC connection parameters.&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For most users, the defaults work fine. If you&apos;re connecting to an Amazon RDS or Aurora instance, the default SSL settings are compatible.&lt;/p&gt;
&lt;h3&gt;5. Set Reflection Refresh Schedule&lt;/h3&gt;
&lt;p&gt;This controls how often Dremio refreshes pre-computed Reflections built on PostgreSQL data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Refresh every:&lt;/strong&gt; How often Reflections update (hours, days, or weeks). More frequent refreshes mean fresher data but more queries against PostgreSQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expire after:&lt;/strong&gt; How long before unused Reflections are automatically removed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For operational PostgreSQL data that changes throughout the day, a refresh interval of 1-4 hours is typical. For historical data that rarely changes, daily or weekly is sufficient.&lt;/p&gt;
&lt;h3&gt;6. Configure Metadata Refresh&lt;/h3&gt;
&lt;p&gt;These settings control how often Dremio checks PostgreSQL for new or changed tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dataset Discovery (Fetch every):&lt;/strong&gt; How often Dremio looks for new tables or schema changes. Default is 1 hour.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dataset Details (Fetch every):&lt;/strong&gt; How often Dremio refreshes detailed metadata for tables you&apos;ve already queried. Default is 1 hour.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;7. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally, restrict which Dremio users or roles can access this PostgreSQL source. Click &lt;strong&gt;Save&lt;/strong&gt; to create the connection.&lt;/p&gt;
&lt;h2&gt;Query PostgreSQL Data from Dremio&lt;/h2&gt;
&lt;p&gt;Once connected, your PostgreSQL database appears as a source in Dremio&apos;s SQL Runner. Browse the source to see schemas and tables, then query them directly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT customer_id, first_name, last_name, signup_date
FROM &amp;quot;production-postgres&amp;quot;.public.customers
WHERE signup_date &amp;gt; &apos;2024-01-01&apos;
ORDER BY signup_date DESC
LIMIT 100;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The source name (&lt;code&gt;production-postgres&lt;/code&gt;) is the name you gave the source. PostgreSQL schemas appear as sub-folders, and tables appear within those schemas.&lt;/p&gt;
&lt;h2&gt;Federate PostgreSQL with Other Sources&lt;/h2&gt;
&lt;p&gt;The real value appears when you combine PostgreSQL data with other sources. Here&apos;s an example that joins PostgreSQL customer data with S3 clickstream data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  c.customer_id,
  c.first_name || &apos; &apos; || c.last_name AS customer_name,
  c.segment,
  COUNT(e.event_id) AS total_events,
  SUM(CASE WHEN e.event_type = &apos;purchase&apos; THEN 1 ELSE 0 END) AS purchases
FROM &amp;quot;production-postgres&amp;quot;.public.customers c
LEFT JOIN &amp;quot;s3-clickstream&amp;quot;.events.user_events e
  ON c.customer_id = e.user_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.segment
ORDER BY purchases DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the filter and projection operations to PostgreSQL (this is called &lt;strong&gt;predicate pushdown&lt;/strong&gt;), fetches only the matching rows, then joins them with the S3 data in Dremio&apos;s query engine. PostgreSQL handles what it&apos;s good at (filtering indexed columns), and Dremio handles the cross-source join.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over PostgreSQL&lt;/h2&gt;
&lt;p&gt;Create views to give your PostgreSQL data business-friendly names and logic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_overview AS
SELECT
  c.customer_id,
  c.first_name || &apos; &apos; || c.last_name AS full_name,
  c.email,
  c.segment AS customer_segment,
  c.signup_date,
  CASE
    WHEN c.segment = &apos;Enterprise&apos; AND c.lifetime_value &amp;gt; 50000 THEN &apos;Strategic&apos;
    WHEN c.lifetime_value &amp;gt; 10000 THEN &apos;High Value&apos;
    ELSE &apos;Standard&apos;
  END AS account_tier
FROM &amp;quot;production-postgres&amp;quot;.public.customers c;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then attach wiki descriptions and labels through the Catalog (edit pencil icon → Details tab → Generate Wiki/Tags) so the AI Agent understands the data when users ask natural language questions.&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown: What Dremio Offloads to PostgreSQL&lt;/h2&gt;
&lt;p&gt;Dremio doesn&apos;t download entire PostgreSQL tables and process them locally. When possible, it pushes operations back to PostgreSQL to minimize data transfer. PostgreSQL supports an extensive set of pushdowns in Dremio, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Logical operators:&lt;/strong&gt; AND, OR, NOT&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comparisons:&lt;/strong&gt; =, !=, &amp;lt;, &amp;gt;, &amp;lt;=, &amp;gt;=, BETWEEN, IN, LIKE, IS NULL, IS NOT NULL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregations:&lt;/strong&gt; SUM, AVG, COUNT, MIN, MAX, STDDEV, MEDIAN, VAR_POP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math functions:&lt;/strong&gt; ABS, CEIL, FLOOR, ROUND, MOD, SQRT, POWER, LOG&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;String functions:&lt;/strong&gt; CONCAT, SUBSTR, LENGTH, LOWER, UPPER, TRIM, REPLACE, REVERSE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date functions:&lt;/strong&gt; DATE_ADD, DATE_SUB, DATE_TRUNC (day, hour, month, quarter, year), EXTRACT&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means a query like &lt;code&gt;SELECT department, AVG(salary) FROM postgres.hr.employees WHERE hire_date &amp;gt; &apos;2023-01-01&apos; GROUP BY department&lt;/code&gt; runs mostly on PostgreSQL : Dremio sends the filter, aggregation, and grouping to Postgres and only transfers the summarized result.&lt;/p&gt;
&lt;h2&gt;Accelerate PostgreSQL Queries with Reflections&lt;/h2&gt;
&lt;p&gt;For queries that run frequently, create Reflections to avoid hitting PostgreSQL repeatedly:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view over your PostgreSQL data.&lt;/li&gt;
&lt;li&gt;In the Catalog, select the view and create a Reflection.&lt;/li&gt;
&lt;li&gt;Choose the columns and aggregations to include.&lt;/li&gt;
&lt;li&gt;Set the refresh interval (how often Dremio re-queries PostgreSQL to update the Reflection).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After the Reflection is built, Dremio&apos;s query optimizer automatically routes matching queries to the Reflection instead of PostgreSQL. Your analysts see the same tables and write the same SQL : the acceleration is transparent.&lt;/p&gt;
&lt;p&gt;This is particularly valuable for dashboard queries. BI tools like Tableau or Power BI connected to Dremio via Arrow Flight/ODBC get sub-second response times from Reflections, even though the source data lives in PostgreSQL.&lt;/p&gt;
&lt;h2&gt;Data Type Mapping&lt;/h2&gt;
&lt;p&gt;Dremio automatically maps PostgreSQL types to Dremio types. The key mappings to know:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;th&gt;Dremio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BIGINT / BIGSERIAL&lt;/td&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT / SERIAL&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NUMERIC&lt;/td&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;Preserves precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VARCHAR / TEXT / CHAR&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BOOLEAN / BIT&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIMESTAMP / TIMESTAMPTZ&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;Timezone-aware types convert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLOAT4 / FLOAT8&lt;/td&gt;
&lt;td&gt;FLOAT / DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BYTEA&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MONEY&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;Converted to numeric&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most types map directly. If you use PostgreSQL-specific types like &lt;code&gt;JSONB&lt;/code&gt;, &lt;code&gt;ARRAY&lt;/code&gt;, or &lt;code&gt;HSTORE&lt;/code&gt;, those are not supported in Dremio&apos;s connector and won&apos;t appear in query results.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on PostgreSQL Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions about PostgreSQL data in plain English. Instead of writing SQL, a business user asks &amp;quot;Who are our highest-value enterprise customers?&amp;quot; and the Agent generates the correct query by reading the wiki descriptions attached to your semantic layer views. The Agent understands that &amp;quot;highest-value&amp;quot; maps to &lt;code&gt;lifetime_value&lt;/code&gt; and &amp;quot;enterprise&amp;quot; maps to &lt;code&gt;segment = &apos;Enterprise&apos;&lt;/code&gt; because you&apos;ve defined it in the view&apos;s wiki.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI chat clients :  Claude, ChatGPT, and others ,  to your PostgreSQL data through Dremio. The hosted MCP Server provides OAuth authentication that propagates user identity and authorization for every interaction:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A sales director can ask Claude &amp;quot;Show me our strategic account customers who signed up in Q1&amp;quot; and get governed, accurate results from your PostgreSQL data without SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against PostgreSQL data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify customers based on their profile data
SELECT
  full_name,
  customer_segment,
  account_tier,
  AI_CLASSIFY(
    &apos;Based on this customer profile, predict their likely next action&apos;,
    &apos;Customer: &apos; || full_name || &apos;, Segment: &apos; || customer_segment || &apos;, Tier: &apos; || account_tier,
    ARRAY[&apos;Upsell Opportunity&apos;, &apos;Renewal Risk&apos;, &apos;Expansion Ready&apos;, &apos;Stable&apos;]
  ) AS predicted_action
FROM analytics.gold.customer_overview
WHERE account_tier IN (&apos;Strategic&apos;, &apos;High Value&apos;);

-- Generate personalized engagement plans
SELECT
  full_name,
  AI_GENERATE(
    &apos;Write a one-sentence personalized engagement recommendation&apos;,
    &apos;Customer: &apos; || full_name || &apos;, Segment: &apos; || customer_segment || &apos;, Tier: &apos; || account_tier || &apos;, Signup: &apos; || CAST(signup_date AS VARCHAR)
  ) AS engagement_recommendation
FROM analytics.gold.customer_overview
WHERE account_tier = &apos;Strategic&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; categorizes data with LLM inference inside SQL. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces text. &lt;code&gt;AI_SIMILARITY&lt;/code&gt; (not shown) finds semantic matches between text fields. All run directly in your query.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;If you run PostgreSQL for your application data and want to include it in cross-source analytics, AI-driven queries, or governed dashboards without building ETL pipelines, Dremio Cloud is the fastest path.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-postgresql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your PostgreSQL instance in under 5 minutes.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect AWS Glue Data Catalog to Dremio Cloud: Query and Manage Your AWS Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2026-03-connector-aws-glue/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-aws-glue/</guid><description>
AWS Glue Data Catalog is AWS&apos;s managed metadata service for data lakes. It stores table definitions, schemas, partition information, and statistics f...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;AWS Glue Data Catalog is AWS&apos;s managed metadata service for data lakes. It stores table definitions, schemas, partition information, and statistics for data stored in Amazon S3. If you&apos;ve built your data lake on AWS using Apache Spark (on EMR), AWS Glue ETL jobs, or Amazon Athena, your table metadata lives in Glue. But Glue is just a catalog : a registry of what&apos;s where. To actually query the data, you need Athena (per-TB pricing), EMR clusters (infrastructure management), or Redshift Spectrum (additional cost).&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to your Glue Data Catalog and queries the underlying Iceberg tables with full read and write support. You get enterprise-grade SQL, Reflections for query acceleration, governance, and AI analytics : all on top of your existing Glue-managed lakehouse.&lt;/p&gt;
&lt;h2&gt;Why Glue Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Query Without Athena&apos;s Per-TB Pricing&lt;/h3&gt;
&lt;p&gt;Athena charges per terabyte of data scanned, regardless of whether the query is the same one you ran 5 minutes ago. For teams running dashboard queries, scheduled reports, and ad-hoc exploration, this pricing model creates unpredictable costs. Dremio&apos;s Reflections cache results so repeated queries don&apos;t re-scan S3. C3 (Columnar Cloud Cache) caches file data on local NVMe for frequently accessed datasets. You pay for Dremio compute time, not per-TB scanned.&lt;/p&gt;
&lt;h3&gt;Full Read and Write on Iceberg Tables&lt;/h3&gt;
&lt;p&gt;Dremio supports full DML (INSERT, UPDATE, DELETE, MERGE) on Glue-cataloged Iceberg tables. Create tables, run transformations, build data pipelines, and maintain your lakehouse entirely through Dremio&apos;s SQL engine : no need to spin up EMR clusters or Glue ETL jobs for simple transformations.&lt;/p&gt;
&lt;h3&gt;Federate Glue with Non-AWS Sources&lt;/h3&gt;
&lt;p&gt;Your Glue-managed data lake covers AWS data, but your application database is on Azure (Azure SQL), your analytics warehouse is Snowflake, and your marketing data is in Google BigQuery. Dremio federates across all of them in a single SQL query.&lt;/p&gt;
&lt;h3&gt;Automated Iceberg Maintenance&lt;/h3&gt;
&lt;p&gt;Dremio automatically compacts small files into optimally sized ones, rewrites manifests for faster metadata reads, and clusters data based on query patterns : all on Glue-cataloged Iceberg tables. This eliminates the need for manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs or scheduled Glue ETL maintenance tasks.&lt;/p&gt;
&lt;h3&gt;Credential Vending&lt;/h3&gt;
&lt;p&gt;Dremio uses Glue&apos;s credential vending to securely access the underlying S3 data without separate S3 credentials. The catalog provides temporary, scoped credentials for each data request.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AWS Account&lt;/strong&gt; with Glue Data Catalog configured&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IAM Role&lt;/strong&gt; with permissions: &lt;code&gt;glue:GetDatabase&lt;/code&gt;, &lt;code&gt;glue:GetTable&lt;/code&gt;, &lt;code&gt;glue:GetTables&lt;/code&gt;, &lt;code&gt;glue:GetPartitions&lt;/code&gt;, and S3 read/write permissions for underlying data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Region&lt;/strong&gt; where your Glue catalog is deployed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-aws-glue-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Glue to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;glue-catalog&lt;/code&gt; or &lt;code&gt;aws-lakehouse&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Region:&lt;/strong&gt; The region where your Glue catalog is deployed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Provide IAM Role ARN (recommended for Dremio Cloud) or AWS Access Key/Secret Key.&lt;/p&gt;
&lt;h3&gt;4. Select Databases&lt;/h3&gt;
&lt;p&gt;Choose which Glue databases to expose. You can enable specific databases or allow access to all.&lt;/p&gt;
&lt;h3&gt;5. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh, Metadata refresh intervals, and click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query and Write to Glue Iceberg Tables&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query a Glue-cataloged Iceberg table
SELECT product_id, product_name, category, price, inventory_count
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products
WHERE category = &apos;Electronics&apos; AND price &amp;gt; 50 AND inventory_count &amp;gt; 0
ORDER BY price ASC;

-- Write to Glue Iceberg tables
INSERT INTO &amp;quot;glue-catalog&amp;quot;.analytics.daily_summary
SELECT
  DATE_TRUNC(&apos;day&apos;, order_date) AS day,
  COUNT(*) AS order_count,
  SUM(total) AS revenue,
  AVG(total) AS avg_order_value
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.orders
WHERE order_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY
GROUP BY 1;

-- MERGE for upserts
MERGE INTO &amp;quot;glue-catalog&amp;quot;.analytics.product_metrics AS target
USING (
  SELECT product_id, COUNT(*) AS orders, SUM(quantity) AS units_sold
  FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.order_items
  WHERE order_date &amp;gt;= CURRENT_DATE - INTERVAL &apos;7&apos; DAY
  GROUP BY product_id
) AS source
ON target.product_id = source.product_id
WHEN MATCHED THEN UPDATE SET orders = source.orders, units_sold = source.units_sold
WHEN NOT MATCHED THEN INSERT (product_id, orders, units_sold) VALUES (source.product_id, source.orders, source.units_sold);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Non-AWS Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Glue products with external review and supplier data
SELECT
  g.product_name,
  g.price,
  g.category,
  pg.avg_rating,
  pg.review_count,
  sf.supplier_name,
  sf.lead_time_days
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products g
LEFT JOIN &amp;quot;postgres-reviews&amp;quot;.public.product_reviews pg ON g.product_id = pg.product_id
LEFT JOIN &amp;quot;snowflake-supply&amp;quot;.PUBLIC.SUPPLIERS sf ON g.supplier_id = sf.supplier_id
WHERE g.category = &apos;Electronics&apos;
ORDER BY pg.avg_rating DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.product_performance AS
SELECT
  g.product_id,
  g.product_name,
  g.category,
  g.price,
  SUM(oi.quantity) AS units_sold,
  SUM(oi.quantity * g.price) AS revenue,
  CASE
    WHEN SUM(oi.quantity) &amp;gt; 1000 THEN &apos;Best Seller&apos;
    WHEN SUM(oi.quantity) &amp;gt; 100 THEN &apos;Popular&apos;
    ELSE &apos;Niche&apos;
  END AS popularity_tier
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products g
LEFT JOIN &amp;quot;glue-catalog&amp;quot;.ecommerce.order_items oi ON g.product_id = oi.product_id
GROUP BY g.product_id, g.product_name, g.category, g.price;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask &amp;quot;Which electronics products are best sellers?&amp;quot; and the AI Agent generates SQL from your semantic layer. The wiki descriptions you&apos;ve attached to views guide the Agent&apos;s understanding of terms like &amp;quot;best seller&amp;quot; and &amp;quot;popularity tier.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude and ChatGPT to your Glue-cataloged data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A product manager asks ChatGPT &amp;quot;Show me niche electronics products with high ratings that might be under-marketed&amp;quot; and gets governed results from your Glue lakehouse.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate product descriptions from catalog data
SELECT
  product_name,
  category,
  price,
  AI_GENERATE(
    &apos;Write a one-sentence marketing description for this product&apos;,
    &apos;Product: &apos; || product_name || &apos;, Category: &apos; || category || &apos;, Price: $&apos; || CAST(price AS VARCHAR) || &apos;, Popularity: &apos; || popularity_tier
  ) AS marketing_description
FROM analytics.gold.product_performance
WHERE popularity_tier = &apos;Best Seller&apos;;

-- Classify inventory risk
SELECT
  product_name,
  inventory_count,
  AI_CLASSIFY(
    &apos;Based on inventory levels and sales velocity, classify the reorder urgency&apos;,
    &apos;Product: &apos; || product_name || &apos;, Stock: &apos; || CAST(inventory_count AS VARCHAR) || &apos;, Units Sold (7d): &apos; || CAST(units_sold AS VARCHAR),
    ARRAY[&apos;Order Now&apos;, &apos;Order Soon&apos;, &apos;Adequate Stock&apos;, &apos;Overstocked&apos;]
  ) AS reorder_urgency
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products g
JOIN analytics.gold.product_performance pp ON g.product_id = pp.product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on product performance and daily summary views to cache results and serve BI tools with sub-second response times:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, navigate to the view you want to accelerate&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; to cache the full view or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; to pre-compute specific SUM/COUNT/AVG aggregations&lt;/li&gt;
&lt;li&gt;Select columns to include in the Reflection&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : how often Dremio re-queries the underlying Iceberg tables to update the cache&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard queries from Tableau, Power BI, or Looker connected via Arrow Flight hit the Reflection instead of re-reading S3 Iceberg files, providing sub-second response times even for complex aggregations.&lt;/p&gt;
&lt;h2&gt;Time Travel on Glue Iceberg Tables&lt;/h2&gt;
&lt;p&gt;Iceberg tables cataloged in Glue support time travel through Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query a table as it existed 7 days ago
SELECT product_id, price, inventory_count
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products
AT TIMESTAMP &apos;2024-06-01 00:00:00&apos;;

-- Compare current state to a historical snapshot
SELECT
  curr.product_name,
  curr.price AS current_price,
  hist.price AS previous_price,
  ROUND((curr.price - hist.price) / hist.price * 100, 2) AS price_change_pct
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products curr
JOIN &amp;quot;glue-catalog&amp;quot;.ecommerce.products AT TIMESTAMP &apos;2024-01-01 00:00:00&apos; hist
  ON curr.product_id = hist.product_id
WHERE curr.price != hist.price
ORDER BY ABS(price_change_pct) DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Time travel is valuable for auditing (&amp;quot;What were inventory levels at quarter end?&amp;quot;), debugging (&amp;quot;What changed in the last 24 hours?&amp;quot;), and compliance (&amp;quot;Show data as it was on the regulatory reporting date&amp;quot;).&lt;/p&gt;
&lt;h2&gt;Governance on Glue Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance capabilities that Glue and Athena don&apos;t provide natively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Hide sensitive fields (customer PII, pricing details) from specific roles while allowing full access for authorized users. For example, mask &lt;code&gt;customer_email&lt;/code&gt; for marketing analysts but show it for customer support teams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically filter data based on the querying user&apos;s role. A regional manager sees only their region&apos;s data. A global admin sees everything.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance policies apply whether data comes from Glue, PostgreSQL, Snowflake, or any other connected source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across all access methods : SQL Runner, BI tools via Arrow Flight/ODBC, AI Agent queries, and MCP Server interactions.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer compared to JDBC/ODBC for BI tools. After creating views over your Glue data, connect:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector, enter your Dremio Cloud endpoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use the Dremio ODBC driver or Arrow Flight connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; client for high-speed data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use the &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries from these tools benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Glue-cataloged data directly from their IDE. Ask Copilot &amp;quot;Show me product inventory trends from the Glue catalog&amp;quot; and it generates SQL using Dremio&apos;s semantic layer : all without leaving your development environment.&lt;/p&gt;
&lt;h2&gt;Glue vs. Athena vs. Dremio: When to Use Each&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AWS Glue&lt;/th&gt;
&lt;th&gt;Amazon Athena&lt;/th&gt;
&lt;th&gt;Dremio Cloud&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metadata catalog&lt;/td&gt;
&lt;td&gt;Serverless SQL&lt;/td&gt;
&lt;td&gt;Federated analytics + catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (metadata)&lt;/td&gt;
&lt;td&gt;Per TB scanned&lt;/td&gt;
&lt;td&gt;Compute-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via ETL jobs&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Full DML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Federation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Federated queries (limited)&lt;/td&gt;
&lt;td&gt;Full cross-source federation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;AI Agent, MCP, SQL Functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (automatic caching)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IAM only&lt;/td&gt;
&lt;td&gt;IAM + Lake Formation&lt;/td&gt;
&lt;td&gt;FGAC + semantic layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Glue is the metadata catalog. Athena is a query engine with per-TB pricing. Dremio is a federated platform that uses Glue as one of many catalogs and adds AI, governance, and performance acceleration.&lt;/p&gt;
&lt;h2&gt;When to Keep Tables in Glue vs. Use Dremio&apos;s Open Catalog&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Glue:&lt;/strong&gt; Tables managed by existing AWS-native pipelines (EMR, Glue ETL), tables shared across multiple AWS services, data consumed by Athena or Redshift Spectrum alongside Dremio.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio&apos;s Open Catalog:&lt;/strong&gt; New analytical tables, data created through Dremio transformations, datasets where you want zero-configuration automatic maintenance (compaction, vacuuming, Autonomous Reflections).&lt;/p&gt;
&lt;p&gt;You can use both simultaneously : Glue for your existing AWS lakehouse, Dremio&apos;s Open Catalog for new analytical workloads.&lt;/p&gt;
&lt;h2&gt;Dremio vs. Athena for Querying Glue-Managed Tables&lt;/h2&gt;
&lt;p&gt;Both Dremio and Athena can query tables registered in the Glue Data Catalog. Key differences:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Dremio Cloud&lt;/th&gt;
&lt;th&gt;Amazon Athena&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compute-based&lt;/td&gt;
&lt;td&gt;$5/TB scanned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Cache results&lt;/td&gt;
&lt;td&gt;❌ Scans every time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Federation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL, MongoDB, BigQuery, etc.&lt;/td&gt;
&lt;td&gt;S3 + federated queries (limited)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Natural language queries&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP Server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Claude/ChatGPT integration&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BI Tool Connectivity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arrow Flight (10-100x faster)&lt;/td&gt;
&lt;td&gt;ODBC/JDBC only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column masking + row filtering&lt;/td&gt;
&lt;td&gt;Lake Formation policies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Iceberg Write Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full DML&lt;/td&gt;
&lt;td&gt;Full DML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For organizations already using Athena, Dremio adds federation, AI analytics, and cost savings through Reflections. Many teams run both: Athena for quick ad-hoc S3 queries, Dremio for cross-source analytics and BI tool serving.&lt;/p&gt;
&lt;h2&gt;AWS Lake Formation Integration&lt;/h2&gt;
&lt;p&gt;AWS Lake Formation provides fine-grained access control for Glue-managed tables. When connecting to Glue through Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Lake Formation permissions&lt;/strong&gt; govern which tables and columns the Dremio IAM role can access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio FGAC&lt;/strong&gt; adds additional governance layers (column masking, row-level filtering) on top of Lake Formation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Both layers work together:&lt;/strong&gt; Lake Formation controls what Dremio can see; Dremio FGAC controls what individual users see&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-layer governance model gives you AWS-native access control at the storage level and Dremio-managed access control at the query level : comprehensive governance without compromising on either side.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;AWS Glue Data Catalog users can query, write, optimize, and AI-enrich their Iceberg tables through Dremio Cloud : with federation, governance, and performance acceleration that Athena and EMR don&apos;t provide.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-aws-glue-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Glue catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Snowflake Open Catalog to Dremio Cloud: Multi-Engine Iceberg Analytics</title><link>https://iceberglakehouse.com/posts/2026-03-connector-snowflake-open-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-snowflake-open-catalog/</guid><description>
Snowflake Open Catalog is Snowflake&apos;s managed implementation of the Apache Iceberg REST catalog specification, based on the open-source Apache Polari...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Snowflake Open Catalog is Snowflake&apos;s managed implementation of the Apache Iceberg REST catalog specification, based on the open-source Apache Polaris project. It serves as a centralized metadata catalog for Apache Iceberg tables, enabling multiple compute engines : including Dremio, Spark, Trino, and Flink , to read from and write to the same Iceberg tables without metadata conflicts.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Snowflake Open Catalog as a first-class Iceberg data source. You get full read and write access to Iceberg tables, automatic table maintenance (compaction, manifest optimization, vacuuming), and the ability to federate catalog data with databases, object storage, cloud warehouses, and other catalogs : all through standard SQL.&lt;/p&gt;
&lt;p&gt;For organizations already invested in Snowflake, the Open Catalog is a strategic choice for multi-engine interoperability. Unlike Snowflake&apos;s proprietary internal catalog (which is only accessible through Snowflake compute), the Open Catalog exposes Iceberg metadata via a standard REST API. This means you&apos;re not locked into Snowflake compute for every analytical query : Dremio can read the same tables at a fraction of the credit cost for repetitive workloads. Dremio also provides its federated engine, Reflections, governance, and AI capabilities , all without duplicating data or metadata.&lt;/p&gt;
&lt;h2&gt;Why Snowflake Open Catalog Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Multi-Engine Strategy Without Vendor Lock-In&lt;/h3&gt;
&lt;p&gt;Snowflake Open Catalog is designed for multi-engine compatibility, which makes it an ideal complement to Dremio. By connecting Dremio to your Snowflake Open Catalog, you add a query engine that specializes in areas Snowflake doesn&apos;t:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Federation:&lt;/strong&gt; Join catalog tables with PostgreSQL, MongoDB, S3, BigQuery, and any other Dremio-connected source in a single SQL query : something Snowflake can&apos;t do natively with non-Snowflake sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Autonomous performance management:&lt;/strong&gt; Dremio automatically compacts files, rewrites manifests, and builds Reflections based on query patterns for external catalog tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI-powered querying:&lt;/strong&gt; Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions bring LLM capabilities to your catalog data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cost Optimization&lt;/h3&gt;
&lt;p&gt;Instead of running all workloads through Snowflake credits, offload analytical queries to Dremio. Dremio&apos;s Reflections cache results so repeated queries don&apos;t consume Snowflake credits. For organizations spending significant amounts on Snowflake compute, routing read-heavy analytical workloads through Dremio can reduce overall costs.&lt;/p&gt;
&lt;h3&gt;Federate with Non-Snowflake Sources&lt;/h3&gt;
&lt;p&gt;Snowflake&apos;s data sharing works within Snowflake. But what if you need to join your Snowflake Open Catalog data with PostgreSQL application data, MongoDB user profiles, or S3 raw event logs? Dremio&apos;s federation engine does exactly that : no ETL pipelines, no data duplication.&lt;/p&gt;
&lt;h3&gt;Credential Vending&lt;/h3&gt;
&lt;p&gt;Snowflake Open Catalog supports credential vending, meaning Dremio doesn&apos;t need separate storage credentials to access the underlying S3, Azure, or GCS data. The catalog provides temporary, scoped credentials for accessing data files. This simplifies security configuration and reduces the credentials you need to manage.&lt;/p&gt;
&lt;h3&gt;Write Support for External Catalogs&lt;/h3&gt;
&lt;p&gt;Dremio can write to external Snowflake Open Catalogs, enabling you to create tables, run transformations, and build data pipelines using Dremio&apos;s SQL engine while keeping metadata managed in Snowflake&apos;s catalog.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting to Snowflake Open Catalog, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snowflake Open Catalog account URL&lt;/strong&gt; : the endpoint for your catalog instance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OAuth or Personal Access Token (PAT) credentials&lt;/strong&gt; : for authenticating to the catalog&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog names&lt;/strong&gt; : the specific catalogs you want to access (internal read-only and/or external read-write)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage configuration&lt;/strong&gt; : if credential vending isn&apos;t available for your setup, you&apos;ll need S3, Azure, or GCS credentials for the underlying data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-open-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Snowflake Open Catalog to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt; from the catalog source types.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;snowflake-open-catalog&lt;/code&gt; or &lt;code&gt;lakehouse-catalog&lt;/code&gt;). This appears in SQL queries as the source prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog URL:&lt;/strong&gt; The Snowflake Open Catalog endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials:&lt;/strong&gt; OAuth client ID/secret or a Personal Access Token.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Select Catalogs&lt;/h3&gt;
&lt;p&gt;Choose which catalogs to enable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Internal catalogs&lt;/strong&gt; are read-only from Dremio&apos;s perspective : you can query but not write.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;External catalogs&lt;/strong&gt; support full read and write operations (INSERT, UPDATE, DELETE, MERGE).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh and Metadata schedules. For catalogs with frequently changing tables, more frequent metadata refreshes ensure Dremio sees new tables and schema changes quickly.&lt;/p&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict which Dremio users can access this catalog. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Snowflake Open Catalog Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query an Iceberg table managed by Snowflake Open Catalog
SELECT customer_id, customer_name, total_spend, signup_date
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary
WHERE total_spend &amp;gt; 10000 AND signup_date &amp;gt;= &apos;2024-01-01&apos;
ORDER BY total_spend DESC;

-- Write to an external catalog
INSERT INTO &amp;quot;sf-open-catalog&amp;quot;.analytics.monthly_metrics
SELECT
  DATE_TRUNC(&apos;month&apos;, order_date) AS month,
  COUNT(*) AS order_count,
  SUM(total_amount) AS revenue
FROM &amp;quot;sf-open-catalog&amp;quot;.ecommerce.orders
GROUP BY 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Other Sources&lt;/h2&gt;
&lt;p&gt;Join catalog data with non-Snowflake sources in a single query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  soc.customer_name,
  soc.total_spend AS catalog_spend,
  pg.region,
  pg.account_manager,
  s3.support_ticket_count,
  CASE
    WHEN soc.total_spend &amp;gt; 100000 AND s3.support_ticket_count &amp;lt; 3 THEN &apos;Platinum&apos;
    WHEN soc.total_spend &amp;gt; 50000 THEN &apos;Gold&apos;
    WHEN soc.total_spend &amp;gt; 10000 THEN &apos;Silver&apos;
    ELSE &apos;Standard&apos;
  END AS customer_tier
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary soc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON soc.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-support&amp;quot;.tickets.customer_tickets s3 ON soc.customer_id = s3.customer_id
ORDER BY catalog_spend DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;p&gt;Create views that combine catalog data with business logic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_health AS
SELECT
  soc.customer_id,
  soc.customer_name,
  soc.total_spend,
  soc.signup_date,
  CASE
    WHEN soc.total_spend &amp;gt; 100000 THEN &apos;Enterprise&apos;
    WHEN soc.total_spend &amp;gt; 25000 THEN &apos;Mid-Market&apos;
    ELSE &apos;SMB&apos;
  END AS customer_segment,
  ROUND(soc.total_spend / GREATEST(DATEDIFF(&apos;MONTH&apos;, soc.signup_date, CURRENT_DATE), 1), 2) AS monthly_spend_rate
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary soc;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on this view, and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This creates the business context that powers Dremio&apos;s AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions in plain English. For example: &amp;quot;Who are our highest-spending enterprise customers?&amp;quot; The Agent reads your wiki descriptions and view definitions to generate the correct SQL. Better wikis produce better results : describe what &amp;quot;enterprise customer&amp;quot; and &amp;quot;monthly spend rate&amp;quot; mean in business terms.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities to Claude, ChatGPT, and other AI chat clients. Connect through the hosted MCP Server:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Your team can then ask Claude &amp;quot;Show me customer health trends from our Snowflake catalog data&amp;quot; and get governed, accurate results without writing SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Enrich catalog data with AI inline in your queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  customer_name,
  total_spend,
  AI_CLASSIFY(
    &apos;Based on spending patterns, classify customer risk of churn&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Total Spend: $&apos; || CAST(total_spend AS VARCHAR) || &apos;, Months Active: &apos; || CAST(months_active AS VARCHAR),
    ARRAY[&apos;Low Risk&apos;, &apos;Moderate Risk&apos;, &apos;High Risk&apos;, &apos;Critical&apos;]
  ) AS churn_risk
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary
WHERE total_spend &amp;gt; 5000;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference in your SQL query. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces narrative summaries, and &lt;code&gt;AI_SIMILARITY&lt;/code&gt; finds semantic matches between text fields.&lt;/p&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on frequently queried views to cache results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, select the view and click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations to include&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : balance freshness against compute cost&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected via Arrow Flight or ODBC get sub-second responses from Reflections instead of re-reading Iceberg files from storage. This reduces Snowflake credit consumption for workloads routed through Dremio.&lt;/p&gt;
&lt;h2&gt;Governance Across Snowflake Open Catalog and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance that spans Snowflake Open Catalog and all other sources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive customer data from specific roles. A marketing analyst sees spending behavior but not PII.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional users see only their region&apos;s data automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; One set of governance rules applies across Snowflake Open Catalog, database connectors, and other external catalogs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for high-speed programmatic access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Snowflake Open Catalog data from their IDE. Ask Copilot &amp;quot;Show me customer churn risk from the catalog&amp;quot; and get SQL generated using your semantic layer : without switching tools.&lt;/p&gt;
&lt;h2&gt;When to Use Snowflake Open Catalog vs. Other Catalogs&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use Snowflake Open Catalog when:&lt;/strong&gt; You&apos;re already in the Snowflake ecosystem and want multi-engine Iceberg access, your team uses Snowflake for data management but needs Dremio for federation and AI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use AWS Glue when:&lt;/strong&gt; You&apos;re AWS-native and want tight integration with EMR, Athena, and S3.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio&apos;s Open Catalog when:&lt;/strong&gt; You want zero-configuration automatic maintenance, Autonomous Reflections, and no external catalog dependencies.&lt;/p&gt;
&lt;p&gt;You can connect multiple catalogs simultaneously. Many organizations use Snowflake Open Catalog for shared enterprise data and Dremio&apos;s Open Catalog for Dremio-specific analytical workloads.&lt;/p&gt;
&lt;h2&gt;Credential Vending in Detail&lt;/h2&gt;
&lt;p&gt;Credential vending is a key feature of Snowflake Open Catalog that simplifies Dremio&apos;s access to underlying storage. Here&apos;s how it works:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;You configure storage in Snowflake Open Catalog&lt;/strong&gt; : specify the S3, Azure, or GCS bucket where Iceberg data files live.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When Dremio queries a table&lt;/strong&gt;, it requests access from the catalog API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake Open Catalog returns temporary, scoped credentials&lt;/strong&gt; : short-lived tokens with permissions limited to the specific data files needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio uses these credentials&lt;/strong&gt; to read (or write, for external catalogs) directly from storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials expire automatically&lt;/strong&gt; : no long-lived keys to rotate or manage.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This means your Dremio Cloud connection needs only the catalog API credentials (OAuth), not separate storage credentials for every S3 bucket or Azure container. One connection, automatic credential management, reduced security surface area.&lt;/p&gt;
&lt;h2&gt;Multi-Engine Architecture with Snowflake Open Catalog&lt;/h2&gt;
&lt;p&gt;Snowflake Open Catalog enables a powerful multi-engine architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snowflake:&lt;/strong&gt; Data engineering, SQL analytics, and catalog management&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio:&lt;/strong&gt; Federation, AI analytics, and Reflection-based BI serving&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Spark:&lt;/strong&gt; Large-scale data processing and ML model training&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trino/Presto:&lt;/strong&gt; Ad-hoc query engine for open-source workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All engines read from the same Iceberg tables managed by Snowflake Open Catalog : no data duplication, no metadata sync issues, no format conversion. Each engine reads the latest table metadata from the catalog and accesses data files via credential vending.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s unique contribution to this architecture is federation (joining catalog tables with non-Iceberg sources), AI capabilities (Agent, MCP, SQL Functions), and Reflections (sub-second BI serving without re-reading storage).&lt;/p&gt;
&lt;h2&gt;Snowflake Open Catalog vs. Apache Polaris&lt;/h2&gt;
&lt;p&gt;Snowflake Open Catalog is based on the open-source Apache Polaris (incubating) project. Key differences:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Snowflake Open Catalog&lt;/th&gt;
&lt;th&gt;Apache Polaris (self-managed)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed by Snowflake&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Credential Vending&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Requires configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authentication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake OAuth&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake support&lt;/td&gt;
&lt;td&gt;Community&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake pricing&lt;/td&gt;
&lt;td&gt;Infrastructure costs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you use Snowflake&apos;s managed offering, you get turnkey catalog management. If you prefer self-managed, Apache Polaris works with Dremio&apos;s Iceberg REST Catalog connector.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Snowflake Open Catalog users can build a truly multi-engine lakehouse : manage Iceberg metadata in Snowflake&apos;s infrastructure while querying with Dremio&apos;s federated engine, AI capabilities, and Reflection-based acceleration.&lt;/p&gt;
&lt;p&gt;Connect your Snowflake Open Catalog to Dremio Cloud, build views over your Iceberg tables, and start leveraging AI Agent, MCP Server, and Reflections for cost-optimized analytical serving. The setup takes minutes and works immediately.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-open-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Snowflake Open Catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Databricks Unity Catalog to Dremio Cloud: Query Delta Lake Tables with Federation and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-unity-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-unity-catalog/</guid><description>
Databricks Unity Catalog is Databricks&apos; governance layer for data and AI assets. It manages Delta Lake tables, machine learning models, feature store...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Databricks Unity Catalog is Databricks&apos; governance layer for data and AI assets. It manages Delta Lake tables, machine learning models, feature stores, and other data objects across Databricks workspaces. If your data engineering team uses Databricks for ETL and ML, your curated analytical datasets likely live in Unity Catalog as Delta Lake tables.&lt;/p&gt;
&lt;p&gt;With UniForm, Databricks generates Iceberg-compatible metadata for Delta Lake tables, making them readable by non-Databricks engines without data conversion. This is where Dremio Cloud enters the picture: connect to Unity Catalog through the UniForm Iceberg compatibility layer and query your Delta Lake tables alongside every other data source in your organization : with federation, governance, AI analytics, and performance acceleration that Databricks alone doesn&apos;t provide.&lt;/p&gt;
&lt;h2&gt;Why Unity Catalog Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Multi-Engine Analytics Beyond Databricks&lt;/h3&gt;
&lt;p&gt;Unity Catalog centralizes governance for Databricks. But your data consumers use tools beyond Databricks notebooks : Tableau, Power BI, custom Python applications, and business analysts who work in SQL. Dremio provides a high-performance SQL layer that serves all these tools via Arrow Flight (10-100x faster than JDBC/ODBC) or standard ODBC connections.&lt;/p&gt;
&lt;p&gt;Instead of provisioning Databricks SQL warehouses for BI workloads (which consume Databricks Units), route those queries through Dremio where Reflections cache results and Autonomous Reflections automatically optimize query performance.&lt;/p&gt;
&lt;h3&gt;Federate with Non-Databricks Sources&lt;/h3&gt;
&lt;p&gt;Your Delta Lake tables in Unity Catalog contain curated, processed analytics data. But your operational databases (PostgreSQL, SQL Server, Oracle) live outside Databricks. Your cloud warehouses (Snowflake, Redshift) hold other analytical datasets. Your raw files (S3, Azure Storage) contain event logs and unstructured data. Without a federation layer, combining these with Delta Lake data requires Databricks ingestion pipelines for each source.&lt;/p&gt;
&lt;p&gt;Dremio queries each source in place and joins them in a single SQL statement : no ingestion required.&lt;/p&gt;
&lt;h3&gt;Unified Governance Beyond Databricks&lt;/h3&gt;
&lt;p&gt;Unity Catalog governs data within Databricks. Dremio&apos;s Fine-Grained Access Control (FGAC) governs data across Unity Catalog, PostgreSQL, S3, BigQuery, and every other connected source. One set of column masking and row-level filtering policies, applied consistently everywhere.&lt;/p&gt;
&lt;h3&gt;AI Analytics on Delta Lake Data&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s semantic layer, AI Agent, MCP Server, and AI SQL Functions add capabilities that Databricks&apos; Genie doesn&apos;t replicate : particularly for cross-source analytics and integration with external AI clients like Claude and ChatGPT.&lt;/p&gt;
&lt;h3&gt;Credential Vending&lt;/h3&gt;
&lt;p&gt;Unity Catalog supports credential vending across AWS, Azure, and GCS. This means Dremio doesn&apos;t need separate S3 or Azure Storage credentials to access the underlying data files : the catalog provides temporary, scoped credentials automatically.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Databricks workspace URL&lt;/strong&gt; : your Databricks deployment URL (e.g., &lt;code&gt;https://mycompany.cloud.databricks.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Personal Access Token (PAT)&lt;/strong&gt; or OAuth credentials for Databricks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UniForm enabled&lt;/strong&gt; on the Delta Lake tables you want to query (this generates Iceberg-compatible metadata)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage configuration&lt;/strong&gt; : AWS, Azure, or GCS (credential vending handles this if configured)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-unity-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Enabling UniForm on Delta Lake Tables&lt;/h3&gt;
&lt;p&gt;To make Delta Lake tables readable from Dremio, enable UniForm in Databricks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- In Databricks, enable UniForm when creating a table
CREATE TABLE my_catalog.my_schema.my_table (
  id BIGINT,
  name STRING,
  value DOUBLE
) TBLPROPERTIES (
  &apos;delta.universalFormat.enabledFormats&apos; = &apos;iceberg&apos;
);

-- Or alter an existing table
ALTER TABLE my_catalog.my_schema.my_table SET TBLPROPERTIES (
  &apos;delta.universalFormat.enabledFormats&apos; = &apos;iceberg&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step-by-Step: Connect Unity Catalog to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Unity Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;unity-catalog&lt;/code&gt; or &lt;code&gt;databricks-lakehouse&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workspace URL:&lt;/strong&gt; Your Databricks workspace URL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials:&lt;/strong&gt; Personal Access Token or OAuth credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Select Catalogs and Schemas&lt;/h3&gt;
&lt;p&gt;Choose which Unity Catalog catalogs and schemas to expose in Dremio. Only tables with UniForm enabled will be readable.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh and Metadata schedules. More frequent metadata refreshes help Dremio discover new tables and schema changes faster.&lt;/p&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict access, then click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Delta Lake Tables via UniForm&lt;/h2&gt;
&lt;p&gt;From Dremio&apos;s perspective, UniForm tables appear as standard Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query ML model predictions
SELECT
  customer_id,
  churn_probability,
  predicted_ltv,
  prediction_date
FROM &amp;quot;unity-catalog&amp;quot;.ml_models.customer_predictions
WHERE churn_probability &amp;gt; 0.7 AND prediction_date &amp;gt;= &apos;2024-06-01&apos;
ORDER BY churn_probability DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Non-Databricks Sources&lt;/h2&gt;
&lt;p&gt;Join Delta Lake model outputs with operational data from other systems:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Combine ML predictions with CRM data and support logs
SELECT
  uc.customer_id,
  uc.churn_probability,
  uc.predicted_ltv,
  pg.customer_name,
  pg.contract_end_date,
  pg.account_manager,
  s3.last_login_date,
  s3.support_tickets_30d,
  CASE
    WHEN uc.churn_probability &amp;gt; 0.8 AND pg.contract_end_date &amp;lt; CURRENT_DATE + INTERVAL &apos;90&apos; DAY THEN &apos;Critical - Immediate Action&apos;
    WHEN uc.churn_probability &amp;gt; 0.7 THEN &apos;High Risk - Outreach Needed&apos;
    WHEN uc.churn_probability &amp;gt; 0.5 THEN &apos;Watch List&apos;
    ELSE &apos;Healthy&apos;
  END AS action_required
FROM &amp;quot;unity-catalog&amp;quot;.ml_models.customer_predictions uc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON uc.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-logs&amp;quot;.activity.user_activity s3 ON uc.customer_id = s3.user_id
WHERE uc.churn_probability &amp;gt; 0.5
ORDER BY uc.churn_probability DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three data systems (Databricks, PostgreSQL, S3), one query, and actionable churn intervention recommendations.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_risk_dashboard AS
SELECT
  uc.customer_id,
  pg.customer_name,
  pg.region,
  uc.churn_probability,
  uc.predicted_ltv,
  CASE
    WHEN uc.predicted_ltv &amp;gt; 100000 THEN &apos;Enterprise&apos;
    WHEN uc.predicted_ltv &amp;gt; 25000 THEN &apos;Mid-Market&apos;
    ELSE &apos;SMB&apos;
  END AS value_segment,
  CASE
    WHEN uc.churn_probability &amp;gt; 0.7 THEN &apos;High Risk&apos;
    WHEN uc.churn_probability &amp;gt; 0.4 THEN &apos;Moderate Risk&apos;
    ELSE &apos;Low Risk&apos;
  END AS risk_tier
FROM &amp;quot;unity-catalog&amp;quot;.ml_models.customer_predictions uc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON uc.customer_id = pg.customer_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This creates business descriptions like &amp;quot;customer_risk_dashboard: Contains one row per customer combining ML churn predictions from Databricks with CRM account details&amp;quot; : context that powers AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Delta Lake Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions in plain English: &amp;quot;Which enterprise customers are at high risk of churning?&amp;quot; or &amp;quot;Show me our top 10 customers by predicted lifetime value.&amp;quot; The Agent reads your wiki descriptions to understand what &amp;quot;enterprise,&amp;quot; &amp;quot;high risk,&amp;quot; and &amp;quot;lifetime value&amp;quot; mean in your data context, then generates accurate SQL.&lt;/p&gt;
&lt;p&gt;This is particularly powerful for Delta Lake data because model outputs (churn scores, predictions) often need business interpretation. The semantic layer bridges the gap between ML model outputs and business-friendly analytics.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Dremio data. Setup:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude, &lt;code&gt;https://chatgpt.com/connector_platform_oauth_redirect&lt;/code&gt; for ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A customer success manager can ask Claude &amp;quot;Show me all high-risk enterprise customers with contracts ending in the next 90 days&amp;quot; and get accurate, governed results from your Unity Catalog ML predictions : without knowing SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against Unity Catalog data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate personalized retention messages for at-risk customers
SELECT
  customer_name,
  churn_probability,
  predicted_ltv,
  AI_GENERATE(
    &apos;Write a one-sentence personalized retention offer for this at-risk customer&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Segment: &apos; || value_segment || &apos;, Risk: &apos; || risk_tier || &apos;, LTV: $&apos; || CAST(predicted_ltv AS VARCHAR)
  ) AS retention_message
FROM analytics.gold.customer_risk_dashboard
WHERE risk_tier = &apos;High Risk&apos; AND value_segment = &apos;Enterprise&apos;;

-- Classify intervention urgency
SELECT
  customer_name,
  AI_CLASSIFY(
    &apos;Based on these risk factors, classify the intervention urgency&apos;,
    &apos;Churn probability: &apos; || CAST(churn_probability AS VARCHAR) || &apos;, LTV: $&apos; || CAST(predicted_ltv AS VARCHAR) || &apos;, Segment: &apos; || value_segment,
    ARRAY[&apos;Immediate&apos;, &apos;This Week&apos;, &apos;This Month&apos;, &apos;Monitor&apos;]
  ) AS urgency
FROM analytics.gold.customer_risk_dashboard
WHERE risk_tier IN (&apos;High Risk&apos;, &apos;Moderate Risk&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Important Notes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read-only access.&lt;/strong&gt; Dremio connects to Unity Catalog tables in read-only mode. Write operations continue through Databricks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UniForm required.&lt;/strong&gt; Only Delta Lake tables with UniForm enabled appear as queryable Iceberg tables in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table format transparency.&lt;/strong&gt; From Dremio&apos;s perspective, UniForm tables look and behave like standard Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential vending.&lt;/strong&gt; When configured, Dremio receives temporary credentials from Unity Catalog, simplifying storage access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on Unity Catalog views to cache results and serve dashboard queries without re-reading Delta Lake files:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; : balance between data freshness and compute cost&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected to Dremio get sub-second response times from Reflections. This eliminates the need for Databricks SQL warehouses for read-heavy BI workloads.&lt;/p&gt;
&lt;h2&gt;Governance Across Unity Catalog and Other Sources&lt;/h2&gt;
&lt;p&gt;Unity Catalog governs data within Databricks. Dremio&apos;s Fine-Grained Access Control (FGAC) extends governance across Unity Catalog and every other connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask churn probability or predicted LTV from specific roles. A sales rep sees risk tier but not raw probability scores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional account managers see only customers in their territory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same rules apply across Unity Catalog, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC. For Delta Lake data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector : avoids Databricks SQL warehouse costs for BI&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to ML model outputs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot lets developers query Unity Catalog data from their IDE. Ask Copilot &amp;quot;Show me high-risk enterprise customers from the churn model&amp;quot; and get SQL from your semantic layer : without switching to Databricks notebooks or SQL warehouses.&lt;/p&gt;
&lt;h2&gt;When to Use Dremio vs. Databricks SQL&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio when:&lt;/strong&gt; You need cross-source federation (joining Delta Lake with PostgreSQL, S3, Snowflake), you want AI analytics on federated data, you need governance across multiple data sources, you want Reflection-based caching for BI tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Databricks SQL when:&lt;/strong&gt; You need write-heavy workloads on Delta Lake, you&apos;re running Databricks-native jobs (streaming, ML training), your queries use Databricks-specific SQL extensions.&lt;/p&gt;
&lt;p&gt;Both can coexist : Databricks for data engineering and ML, Dremio for federated analytics, AI, and BI serving.&lt;/p&gt;
&lt;h2&gt;Delta Lake Tables in Dremio&lt;/h2&gt;
&lt;p&gt;Dremio reads Delta Lake tables from Unity Catalog with full Delta protocol support:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time travel:&lt;/strong&gt; Query tables at specific versions using Delta Lake&apos;s transaction log&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Dremio automatically detects schema changes made by Databricks jobs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition pruning:&lt;/strong&gt; Dremio leverages Delta Lake&apos;s partition statistics to skip irrelevant data files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column statistics:&lt;/strong&gt; Delta Lake&apos;s min/max statistics enable efficient predicate pushdown&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For write operations, use Dremio&apos;s Open Catalog with Iceberg tables for new analytical workloads. Unity Catalog remains the source of truth for Databricks-managed Delta Lake tables.&lt;/p&gt;
&lt;h2&gt;Databricks Cost Optimization&lt;/h2&gt;
&lt;p&gt;Databricks pricing is based on Databricks Units (DBUs) consumed by SQL warehouses, clusters, and jobs. Dremio helps optimize costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BI serving:&lt;/strong&gt; Instead of running a Databricks SQL warehouse 24/7 for dashboards, create Reflections in Dremio. Dashboard queries hit Dremio, SQL warehouse auto-stops.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ad-hoc exploration:&lt;/strong&gt; Analysts query Dremio&apos;s cached Reflections instead of waking Databricks clusters. Less start/stop overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-source queries:&lt;/strong&gt; Joining Delta Lake with PostgreSQL or S3 doesn&apos;t require moving all data into Databricks : Dremio federates in place.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For organizations spending $50K+/month on Databricks, routing read-heavy analytical workloads through Dremio can reduce DBU consumption by 30-50% on those workloads.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Unity Catalog users can extend their Databricks investment with Dremio&apos;s federation, AI analytics, and performance acceleration : without moving data out of Delta Lake. Dremio and Databricks are complementary: Databricks handles data engineering, ML training, and streaming workloads on Delta Lake tables, while Dremio serves analytical queries, BI dashboards, and AI-powered natural language access across your entire data estate.&lt;/p&gt;
&lt;p&gt;Connect your Unity Catalog to Dremio Cloud, build Reflections on frequently queried tables, and enable the AI Agent for business users who need answers without writing SQL.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-unity-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Unity Catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Any Iceberg REST Catalog to Dremio Cloud: Universal Lakehouse Access</title><link>https://iceberglakehouse.com/posts/2026-03-connector-iceberg-rest-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-iceberg-rest-catalog/</guid><description>
The Apache Iceberg REST Catalog specification defines a standard HTTP API for managing Iceberg table metadata. Any catalog implementation that confor...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The Apache Iceberg REST Catalog specification defines a standard HTTP API for managing Iceberg table metadata. Any catalog implementation that conforms to this specification : Apache Polaris, Amazon S3 Tables, Confluent Tableflow, Tabular, Apache Gravitino, and custom-built services , can connect to Dremio Cloud through a single connector type.&lt;/p&gt;
&lt;p&gt;This is the most flexible catalog connector Dremio offers. Instead of needing a purpose-built connector for every catalog vendor, the Iceberg REST Catalog connector works with any compliant implementation. As new catalogs emerge : and they&apos;re emerging rapidly in the open lakehouse ecosystem , this connector ensures Dremio supports them from day one.&lt;/p&gt;
&lt;p&gt;The Iceberg REST specification is becoming the universal standard for lakehouse catalog interoperability. AWS launched Amazon S3 Tables (a fully managed Iceberg catalog with REST API) in late 2024, Confluent released Tableflow for streaming-to-Iceberg ingestion, and Apache Gravitino provides multi-catalog governance. All of these work with Dremio&apos;s REST Catalog connector without any Dremio-side code changes.&lt;/p&gt;
&lt;h3&gt;Credential Vending Advantage&lt;/h3&gt;
&lt;p&gt;Many REST catalogs support credential vending : the ability to issue temporary, scoped storage credentials to clients. When configured, Dremio receives short-lived tokens that grant access only to the specific data files needed for a query. This eliminates the need to store long-lived S3 access keys or Azure storage keys in Dremio&apos;s connection configuration, significantly reducing the security surface area. One REST catalog connection replaces what would otherwise require separate storage credentials for every S3 bucket, Azure container, or GCS bucket containing your Iceberg tables.&lt;/p&gt;
&lt;h2&gt;Why Iceberg REST Catalog Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Universal Compatibility&lt;/h3&gt;
&lt;p&gt;The Iceberg REST Catalog connector works with any catalog implementation that conforms to the Iceberg REST API spec. This includes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Catalog&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Credential Vending&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Apache Polaris&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon S3 Tables&lt;/td&gt;
&lt;td&gt;AWS managed&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confluent Tableflow&lt;/td&gt;
&lt;td&gt;Confluent managed&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tabular&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache Gravitino&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom REST implementations&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You&apos;re not locked into specific catalog vendors. Deploy Apache Polaris today, consider S3 Tables tomorrow : the same Dremio connector works for both.&lt;/p&gt;
&lt;h3&gt;Read and Write Support&lt;/h3&gt;
&lt;p&gt;Dremio supports full DML (INSERT, UPDATE, DELETE, MERGE) on Iceberg tables managed by REST catalogs. You can create tables, run transformations, build data pipelines, and maintain your lakehouse entirely through Dremio&apos;s SQL engine. No need for separate Spark clusters or ETL jobs for routine operations.&lt;/p&gt;
&lt;h3&gt;Multi-Catalog Federation&lt;/h3&gt;
&lt;p&gt;Connect multiple REST catalogs alongside databases (PostgreSQL, MySQL, Oracle), object storage (S3, Azure), cloud warehouses (Snowflake, BigQuery), and other catalogs (Glue, Unity) : then query across all of them in a single SQL statement.&lt;/p&gt;
&lt;h3&gt;Automated Iceberg Maintenance&lt;/h3&gt;
&lt;p&gt;Dremio automatically compacts small files, rewrites manifests for faster metadata reads, and clusters data based on query patterns : even for tables managed by external REST catalogs.&lt;/p&gt;
&lt;h3&gt;Multiple Authentication Methods&lt;/h3&gt;
&lt;p&gt;The connector supports Bearer Token, OAuth 2.0 (client credentials flow), and custom authentication headers, accommodating the security requirements of different catalog implementations.&lt;/p&gt;
&lt;h3&gt;Flexible Storage Credential Management&lt;/h3&gt;
&lt;p&gt;Some REST catalogs vend temporary storage credentials (short-lived S3/Azure/GCS tokens) for reading and writing data files. Dremio supports credential vending where available. When a catalog doesn&apos;t vend credentials, you can configure storage access directly in Dremio.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog endpoint URL&lt;/strong&gt; : the base URL of the catalog API (e.g., &lt;code&gt;https://my-polaris.example.com/api/catalog&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication credentials&lt;/strong&gt; : Bearer token, OAuth client ID/secret, or custom headers depending on the catalog&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage access&lt;/strong&gt; : either through credential vending (catalog provides temporary tokens) or direct storage credentials (S3, Azure, GCS)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-iceberg-rest-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Iceberg REST Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;polaris-catalog&lt;/code&gt; or &lt;code&gt;s3-tables&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog Endpoint URL:&lt;/strong&gt; The base URL for the REST API.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bearer Token:&lt;/strong&gt; For token-based authentication (e.g., PAT tokens).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OAuth 2.0:&lt;/strong&gt; Client ID and client secret for OAuth client credentials flow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;None:&lt;/strong&gt; For catalogs that use other authentication methods (configured via custom headers).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Storage&lt;/h3&gt;
&lt;p&gt;If credential vending is supported, Dremio receives temporary credentials automatically. Otherwise, configure S3 (access key/secret or IAM role), Azure (shared key or service principal), or GCS (service account key) credentials.&lt;/p&gt;
&lt;h3&gt;5. Advanced Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Custom Headers:&lt;/strong&gt; Additional HTTP headers required by the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query Parameters:&lt;/strong&gt; URL parameters appended to catalog API requests.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog-specific properties:&lt;/strong&gt; Key-value pairs for vendor-specific configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query and Write to REST Catalog Tables&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query Iceberg tables from a REST catalog
SELECT order_id, customer_id, order_total, order_date
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders
WHERE order_date &amp;gt;= &apos;2024-01-01&apos; AND order_total &amp;gt; 100
ORDER BY order_total DESC;

-- Write to the catalog
INSERT INTO &amp;quot;rest-catalog&amp;quot;.analytics.daily_summary
SELECT
  DATE_TRUNC(&apos;day&apos;, order_date) AS day,
  COUNT(*) AS order_count,
  SUM(order_total) AS revenue,
  AVG(order_total) AS avg_order_value
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders
WHERE order_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY
GROUP BY 1;

-- MERGE for upserts
MERGE INTO &amp;quot;rest-catalog&amp;quot;.analytics.customer_metrics AS target
USING (
  SELECT customer_id, COUNT(*) AS orders, SUM(order_total) AS total_spent
  FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders
  WHERE order_date &amp;gt;= CURRENT_DATE - INTERVAL &apos;30&apos; DAY
  GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET orders = source.orders, total_spent = source.total_spent
WHEN NOT MATCHED THEN INSERT (customer_id, orders, total_spent) VALUES (source.customer_id, source.orders, source.total_spent);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join REST catalog data with PostgreSQL and S3
SELECT
  rc.order_id,
  rc.order_total,
  pg.customer_name,
  pg.region,
  s3.support_tickets
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders rc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON rc.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-support&amp;quot;.tickets.customer_counts s3 ON rc.customer_id = s3.customer_id
WHERE rc.order_total &amp;gt; 500
ORDER BY rc.order_total DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_value AS
SELECT
  rc.customer_id,
  pg.customer_name,
  pg.region,
  SUM(rc.order_total) AS lifetime_value,
  COUNT(*) AS total_orders,
  ROUND(AVG(rc.order_total), 2) AS avg_order_value,
  CASE
    WHEN SUM(rc.order_total) &amp;gt; 50000 THEN &apos;Platinum&apos;
    WHEN SUM(rc.order_total) &amp;gt; 10000 THEN &apos;Gold&apos;
    WHEN SUM(rc.order_total) &amp;gt; 1000 THEN &apos;Silver&apos;
    ELSE &apos;Bronze&apos;
  END AS value_tier
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders rc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON rc.customer_id = pg.customer_id
GROUP BY rc.customer_id, pg.customer_name, pg.region;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask &amp;quot;Who are our Platinum customers?&amp;quot; and the AI Agent generates SQL from your semantic layer. The wiki descriptions you attached explain what &amp;quot;Platinum&amp;quot; means (lifetime value &amp;gt; $50,000), so the Agent produces accurate results.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your catalog data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A sales director asks Claude &amp;quot;Show me our top 20 Gold and Platinum customers by lifetime value&amp;quot; and gets governed results from your Iceberg catalog.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate personalized engagement plans
SELECT
  customer_name,
  value_tier,
  lifetime_value,
  AI_GENERATE(
    &apos;Write a one-sentence personalized engagement recommendation&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Tier: &apos; || value_tier || &apos;, LTV: $&apos; || CAST(lifetime_value AS VARCHAR) || &apos;, Orders: &apos; || CAST(total_orders AS VARCHAR &amp;amp;&amp;amp; &apos;, Region: &apos; || region)
  ) AS engagement_plan
FROM analytics.gold.customer_value
WHERE value_tier IN (&apos;Platinum&apos;, &apos;Gold&apos;);

-- Classify churn risk
SELECT
  customer_name,
  AI_CLASSIFY(
    &apos;Based on order patterns, classify churn risk&apos;,
    &apos;Orders: &apos; || CAST(total_orders AS VARCHAR) || &apos;, Avg Order: $&apos; || CAST(avg_order_value AS VARCHAR) || &apos;, LTV: $&apos; || CAST(lifetime_value AS VARCHAR),
    ARRAY[&apos;Low Risk&apos;, &apos;Moderate Risk&apos;, &apos;High Risk&apos;]
  ) AS churn_risk
FROM analytics.gold.customer_value;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on views to cache results and serve BI dashboards with sub-second response times.&lt;/p&gt;
&lt;h2&gt;When to Use REST Catalog vs. Other Iceberg Catalogs&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use REST Catalog when:&lt;/strong&gt; Your organization uses Tabular, Apache Polaris, Gravitino, or another REST-compliant catalog server; you need a vendor-neutral catalog interface; you want portability across different compute engines (Dremio, Spark, Trino, Flink).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use AWS Glue when:&lt;/strong&gt; You&apos;re primarily in the AWS ecosystem and want tight integration with EMR, Athena, and AWS-native tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio&apos;s Open Catalog when:&lt;/strong&gt; You want zero-configuration automatic table maintenance, Autonomous Reflections, and no external catalog setup.&lt;/p&gt;
&lt;p&gt;You can use multiple catalogs simultaneously : for example, REST Catalog for cross-engine shared tables and Dremio&apos;s Open Catalog for Dremio-specific analytical workloads.&lt;/p&gt;
&lt;h2&gt;Governance on REST Catalog Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance that REST catalogs don&apos;t provide:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive columns from specific roles. A business analyst sees aggregated metrics but not individual customer data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by the querying user&apos;s role. Regional users see only their data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across REST Catalog, database sources, and other catalogs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC for BI tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query REST Catalog data from their IDE. Ask Copilot &amp;quot;Show me transaction trends from the Iceberg catalog&amp;quot; and get SQL generated using your semantic layer.&lt;/p&gt;
&lt;h2&gt;REST Catalog Protocol Details&lt;/h2&gt;
&lt;p&gt;The Iceberg REST Catalog protocol is an HTTP-based interface defined by the Apache Iceberg project. Any catalog that implements this protocol works with Dremio&apos;s connector. This includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Polaris (Incubating):&lt;/strong&gt; Open-source REST catalog by Snowflake&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tabular:&lt;/strong&gt; Managed Iceberg catalog service (now part of Databricks)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gravitino:&lt;/strong&gt; Apache-incubating multi-catalog governance platform&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon S3 Tables:&lt;/strong&gt; AWS-managed Iceberg tables with REST API access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom implementations:&lt;/strong&gt; Any service implementing the Iceberg REST spec&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio handles authentication through OAuth2 bearer tokens or custom headers, making it compatible with most enterprise authentication systems.&lt;/p&gt;
&lt;h3&gt;REST Catalog Endpoints&lt;/h3&gt;
&lt;p&gt;The Iceberg REST specification defines standard endpoints for catalog operations:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Dremio Support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;List namespaces&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;List tables&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces/{ns}/tables&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Create table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/namespaces/{ns}/tables&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drop table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DELETE /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Get config&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio uses these endpoints to discover tables, read metadata, perform DML operations, and manage table lifecycle : all through standard HTTP.&lt;/p&gt;
&lt;h3&gt;Multi-Catalog Architecture&lt;/h3&gt;
&lt;p&gt;Many organizations run multiple Iceberg catalogs for different purposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog A (Polaris):&lt;/strong&gt; Shared enterprise data, governed access for all teams&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog B (S3 Tables):&lt;/strong&gt; AWS-native data, auto-managed by AWS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Open Catalog:&lt;/strong&gt; Dremio-specific analytical workloads with Autonomous Reflections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue:&lt;/strong&gt; Legacy Iceberg tables managed by existing EMR pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio connects to all of them simultaneously and federates across them. Views in the semantic layer can join tables from different catalogs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.unified_orders AS
SELECT o.order_id, o.order_total, c.customer_name, i.inventory_status
FROM &amp;quot;polaris-catalog&amp;quot;.ecommerce.orders o
JOIN &amp;quot;s3-tables&amp;quot;.customers.profiles c ON o.customer_id = c.customer_id
JOIN &amp;quot;glue-catalog&amp;quot;.warehouse.inventory i ON o.product_id = i.product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Iceberg REST Catalog users can query, write, federate, and AI-enrich their Iceberg tables through Dremio Cloud : with governance, Reflections, and AI capabilities that no compute engine provides natively. The REST Catalog connector is the most future-proof choice for organizations adopting Iceberg: as new catalog implementations emerge (and the Iceberg ecosystem is expanding rapidly), this single connector supports them all.&lt;/p&gt;
&lt;p&gt;Start by connecting your REST catalog to Dremio Cloud, building a semantic layer over your most important tables, and enabling the AI Agent for natural language querying. The same views and Reflections work regardless of which REST catalog implementation you use : Apache Polaris today, S3 Tables tomorrow, or a custom catalog next year.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-iceberg-rest-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your REST catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Dremio&apos;s Built-in Open Catalog: Your Zero-Configuration Apache Iceberg Lakehouse</title><link>https://iceberglakehouse.com/posts/2026-03-connector-dremio-open-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-dremio-open-catalog/</guid><description>
Every Dremio Cloud account starts with a built-in Open Catalog : a fully managed Apache Iceberg catalog with integrated storage. When you create a Dr...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every Dremio Cloud account starts with a built-in Open Catalog : a fully managed Apache Iceberg catalog with integrated storage. When you create a Dremio Cloud project, you immediately have a catalog where you can create namespaces (folders), tables, and views without connecting any external sources, configuring storage, or setting up credentials.&lt;/p&gt;
&lt;p&gt;This isn&apos;t a bare-bones starting point. The built-in Open Catalog is a production-grade Iceberg catalog with automated performance management, Autonomous Reflections, time travel, branching, and full DML support. It&apos;s the fastest path from &amp;quot;sign up&amp;quot; to &amp;quot;running analytics.&amp;quot;&lt;/p&gt;
&lt;p&gt;Organizations typically spend days or weeks setting up external catalogs : provisioning S3 buckets, configuring IAM roles, debugging credential chains, and testing connectivity. With the built-in Open Catalog, you skip all of that. Your first &lt;code&gt;CREATE TABLE&lt;/code&gt; runs minutes after account creation.&lt;/p&gt;
&lt;p&gt;The Open Catalog is particularly powerful for teams adopting a lakehouse architecture for the first time. Instead of evaluating AWS Glue, Unity Catalog, and Snowflake Open Catalog (each with different setup complexity, vendor dependencies, and pricing models), start with the built-in catalog. You can always connect external catalogs later and federate across them.&lt;/p&gt;
&lt;h3&gt;Cross-Catalog Federation&lt;/h3&gt;
&lt;p&gt;The Open Catalog works alongside external catalogs. A common architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open Catalog:&lt;/strong&gt; Dremio-created analytical tables and views (gold layer, semantic layer)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue:&lt;/strong&gt; Existing Iceberg tables managed by Spark/EMR pipelines&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL:&lt;/strong&gt; Operational application data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio federates across all three in a single SQL query. Views in the Open Catalog can reference tables from any connected source, creating a unified analytical layer that spans your entire data estate.&lt;/p&gt;
&lt;h3&gt;Branching and Tagging&lt;/h3&gt;
&lt;p&gt;The Open Catalog supports Iceberg&apos;s branching and tagging capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Branches:&lt;/strong&gt; Create isolated copies of table metadata for development and testing. Changes on a branch don&apos;t affect the main table until merged.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tags:&lt;/strong&gt; Create named snapshots for milestone tracking (e.g., &lt;code&gt;quarterly-report-2024-Q2&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These features enable data engineering workflows where teams can test transformations on branches before promoting changes to production tables.&lt;/p&gt;
&lt;h2&gt;Why Start with the Built-in Open Catalog&lt;/h2&gt;
&lt;h3&gt;Zero Configuration&lt;/h3&gt;
&lt;p&gt;External catalogs (Glue, Unity, Snowflake Open Catalog) require AWS IAM roles, network configuration, credential management, and catalog-specific setup. The built-in Open Catalog requires nothing : it&apos;s already configured when your project is created. Create a folder, write SQL, and start working.&lt;/p&gt;
&lt;h3&gt;Automated Performance Management&lt;/h3&gt;
&lt;p&gt;Dremio automatically manages the performance of Iceberg tables in the built-in catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Auto-compaction:&lt;/strong&gt; Small files are automatically merged into optimally sized files (typically 256MB). This prevents the &amp;quot;small file problem&amp;quot; that degrades query performance over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest rewriting:&lt;/strong&gt; Table manifests are automatically optimized for faster metadata reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data clustering:&lt;/strong&gt; Dremio sorts data based on query patterns to improve predicate pushdown efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vacuuming:&lt;/strong&gt; Expired snapshots and orphaned data files are automatically cleaned up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Results caching:&lt;/strong&gt; Query results are cached and served for identical subsequent queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of these require manual &lt;code&gt;OPTIMIZE&lt;/code&gt; commands or scheduled maintenance jobs. Dremio handles it all in the background.&lt;/p&gt;
&lt;h3&gt;Autonomous Reflections&lt;/h3&gt;
&lt;p&gt;For tables in the built-in catalog, Dremio can automatically create and manage Reflections based on observed query patterns. If a specific view is queried frequently with certain filters and aggregations, Dremio creates a Reflection to accelerate those patterns without any manual configuration. This automated acceleration means your most common queries get faster over time.&lt;/p&gt;
&lt;h3&gt;Time Travel&lt;/h3&gt;
&lt;p&gt;Query any table as it existed at any point in the past:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query table as it was 7 days ago
SELECT * FROM catalog_folder.my_table AT TIMESTAMP &apos;2024-06-01 00:00:00&apos;;

-- Query a specific snapshot
SELECT * FROM catalog_folder.my_table AT SNAPSHOT &apos;1234567890123456789&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Time travel is valuable for auditing (&amp;quot;What did customer balances look like at quarter end?&amp;quot;), debugging (&amp;quot;What changed in the last 24 hours?&amp;quot;), and compliance (&amp;quot;Show me the data as it was on the regulatory reporting date&amp;quot;).&lt;/p&gt;
&lt;h3&gt;Full DML Support&lt;/h3&gt;
&lt;p&gt;The built-in catalog supports all standard DML operations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- INSERT
INSERT INTO analytics.bronze.events
SELECT event_type, user_id, event_timestamp
FROM &amp;quot;s3-datalake&amp;quot;.events.raw_events
WHERE event_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY;

-- UPDATE
UPDATE analytics.silver.customers
SET segment = &apos;Enterprise&apos;
WHERE total_spend &amp;gt; 100000;

-- DELETE
DELETE FROM analytics.bronze.events
WHERE event_timestamp &amp;lt; CURRENT_DATE - INTERVAL &apos;365&apos; DAY;

-- MERGE (upsert)
MERGE INTO analytics.silver.customers AS target
USING (
  SELECT customer_id, SUM(amount) AS total_spend
  FROM analytics.bronze.orders
  GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET total_spend = source.total_spend
WHEN NOT MATCHED THEN INSERT (customer_id, total_spend) VALUES (source.customer_id, source.total_spend);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Getting Started: Create Your First Tables&lt;/h2&gt;
&lt;p&gt;When you query items in the built-in catalog, you don&apos;t include a source name prefix : just the folder path and table/view name:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create namespace structure
CREATE FOLDER IF NOT EXISTS analytics;
CREATE FOLDER IF NOT EXISTS analytics.bronze;
CREATE FOLDER IF NOT EXISTS analytics.silver;
CREATE FOLDER IF NOT EXISTS analytics.gold;

-- Create a table from an external source
CREATE TABLE analytics.bronze.raw_orders AS
SELECT order_id, customer_id, product_id, quantity, price, order_date
FROM &amp;quot;postgres-orders&amp;quot;.public.orders
WHERE order_date &amp;gt;= &apos;2024-01-01&apos;;

-- Create a transformed table
CREATE TABLE analytics.silver.enriched_orders AS
SELECT
  o.order_id,
  o.customer_id,
  c.customer_name,
  c.region,
  o.product_id,
  p.product_name,
  p.category,
  o.quantity,
  o.price,
  o.quantity * o.price AS total_amount,
  o.order_date
FROM analytics.bronze.raw_orders o
JOIN &amp;quot;postgres-orders&amp;quot;.public.customers c ON o.customer_id = c.customer_id
JOIN &amp;quot;postgres-orders&amp;quot;.public.products p ON o.product_id = p.product_id;

-- Create an analytics view
CREATE VIEW analytics.gold.revenue_summary AS
SELECT
  region,
  category,
  DATE_TRUNC(&apos;month&apos;, order_date) AS month,
  SUM(total_amount) AS revenue,
  COUNT(*) AS orders,
  COUNT(DISTINCT customer_id) AS unique_customers,
  ROUND(SUM(total_amount) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer
FROM analytics.silver.enriched_orders
GROUP BY region, category, DATE_TRUNC(&apos;month&apos;, order_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.product_performance AS
SELECT
  category,
  product_name,
  SUM(total_amount) AS revenue,
  COUNT(*) AS orders,
  CASE
    WHEN SUM(total_amount) &amp;gt; 100000 THEN &apos;Top Performer&apos;
    WHEN SUM(total_amount) &amp;gt; 10000 THEN &apos;Solid&apos;
    ELSE &apos;Emerging&apos;
  END AS performance_tier
FROM analytics.silver.enriched_orders
GROUP BY category, product_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. These descriptions power AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent reads your semantic layer to answer questions in plain English: &amp;quot;What&apos;s our top performing product category this quarter?&amp;quot; or &amp;quot;Show me revenue per customer by region.&amp;quot; The wiki descriptions you create tell the Agent what &amp;quot;top performing&amp;quot; and &amp;quot;revenue per customer&amp;quot; mean, generating accurate SQL automatically.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI tools to your catalog data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client (Claude, ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A VP of Product asks Claude &amp;quot;Compare our product category performance and identify emerging categories with high growth potential&amp;quot; and gets governed, accurate analysis.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate product analysis with AI
SELECT
  product_name,
  performance_tier,
  revenue,
  AI_GENERATE(
    &apos;Write a one-sentence growth strategy for this product&apos;,
    &apos;Product: &apos; || product_name || &apos;, Category: &apos; || category || &apos;, Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Tier: &apos; || performance_tier
  ) AS growth_strategy
FROM analytics.gold.product_performance;

-- Classify products for portfolio management
SELECT
  product_name,
  AI_CLASSIFY(
    &apos;Based on revenue and order volume, classify investment priority&apos;,
    &apos;Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Orders: &apos; || CAST(orders AS VARCHAR),
    ARRAY[&apos;Strategic Investment&apos;, &apos;Maintain&apos;, &apos;Optimize&apos;, &apos;Sunset&apos;]
  ) AS investment_priority
FROM analytics.gold.product_performance;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Built-in vs. External Catalogs&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Built-in Open Catalog&lt;/th&gt;
&lt;th&gt;External Catalogs (Glue, Unity, etc.)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;Zero configuration&lt;/td&gt;
&lt;td&gt;Requires IAM, networking, credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-compaction&lt;/td&gt;
&lt;td&gt;✅ Automatic&lt;/td&gt;
&lt;td&gt;✅ For Iceberg tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autonomous Reflections&lt;/td&gt;
&lt;td&gt;✅ Automatic&lt;/td&gt;
&lt;td&gt;Manual Reflections only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time travel&lt;/td&gt;
&lt;td&gt;✅ Full support&lt;/td&gt;
&lt;td&gt;✅ For Iceberg tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write support&lt;/td&gt;
&lt;td&gt;✅ Full DML&lt;/td&gt;
&lt;td&gt;Varies by catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential management&lt;/td&gt;
&lt;td&gt;None needed&lt;/td&gt;
&lt;td&gt;IAM roles or keys required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage costs&lt;/td&gt;
&lt;td&gt;Included in Dremio&lt;/td&gt;
&lt;td&gt;Separate cloud storage costs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The built-in catalog is ideal for getting started, prototyping, and production workloads. External catalogs are valuable when your organization already manages data in Glue, Unity, or Snowflake Open Catalog and wants to query that data through Dremio.&lt;/p&gt;
&lt;h2&gt;Governance in the Open Catalog&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides enterprise-grade governance on all Open Catalog data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive columns (customer PII, financial details) from specific roles. A data analyst sees &lt;code&gt;customer_name&lt;/code&gt; but not &lt;code&gt;social_security_number&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically restrict data visibility based on user roles. A regional manager querying &lt;code&gt;revenue_summary&lt;/code&gt; sees only their region.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance policies apply across Open Catalog tables, external catalogs, and database sources : one set of rules for all data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent queries, and MCP Server interactions.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer than JDBC/ODBC. After building views in the Open Catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for high-speed programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Autonomous Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Open Catalog data from their IDE. Ask Copilot &amp;quot;Show me this week&apos;s revenue by product category&amp;quot; and it generates SQL using your semantic layer : without switching to the Dremio console.&lt;/p&gt;
&lt;h2&gt;Data Lifecycle Management&lt;/h2&gt;
&lt;p&gt;The Open Catalog supports a complete data lifecycle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Bronze layer:&lt;/strong&gt; Ingest raw data from external sources using &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver layer:&lt;/strong&gt; Apply transformations, deduplication, and type casting with &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt; or &lt;code&gt;MERGE&lt;/code&gt; for incremental updates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold layer:&lt;/strong&gt; Create analytical views with business logic for the semantic layer&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Archival:&lt;/strong&gt; Use &lt;code&gt;DELETE&lt;/code&gt; with time-based conditions to remove old data; use time travel to access historical snapshots before deletion&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This medallion architecture runs entirely within Dremio : no external ETL tools, Spark clusters, or scheduled scripts needed.&lt;/p&gt;
&lt;h3&gt;Incremental Loading Patterns&lt;/h3&gt;
&lt;p&gt;For ongoing data ingestion, use &lt;code&gt;MERGE&lt;/code&gt; to incrementally update tables without full reloads:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Incremental merge: only update changed records
MERGE INTO analytics.silver.customers AS target
USING (
  SELECT customer_id, customer_name, email, segment, updated_at
  FROM &amp;quot;postgres-crm&amp;quot;.public.customers
  WHERE updated_at &amp;gt; (SELECT MAX(updated_at) FROM analytics.silver.customers)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
  customer_name = source.customer_name,
  email = source.email,
  segment = source.segment,
  updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, customer_name, email, segment, updated_at)
  VALUES (source.customer_id, source.customer_name, source.email, source.segment, source.updated_at);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern transfers only changed records, minimizing network traffic and compute costs.&lt;/p&gt;
&lt;h3&gt;Time Travel Best Practices&lt;/h3&gt;
&lt;p&gt;Time travel is particularly valuable in the Open Catalog for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;End-of-quarter reporting:&lt;/strong&gt; Query tables at exact quarter-end timestamps for regulatory submissions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Debugging data issues:&lt;/strong&gt; Compare current data with a previous snapshot to identify when and what changed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit trails:&lt;/strong&gt; Demonstrate data state at any point in time for compliance requirements&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recovery:&lt;/strong&gt; If a bad &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; corrupts data, query the pre-change snapshot and restore&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Compare current vs 24-hours-ago to find changed records
SELECT current_data.customer_id, current_data.segment AS new_segment, old_data.segment AS old_segment
FROM analytics.silver.customers current_data
JOIN analytics.silver.customers AT TIMESTAMP &apos;2024-06-14 00:00:00&apos; old_data
  ON current_data.customer_id = old_data.customer_id
WHERE current_data.segment &amp;lt;&amp;gt; old_data.segment;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud account includes the Open Catalog, ready to go. No setup, no configuration, no external dependencies. Create your first table in under a minute.&lt;/p&gt;
&lt;p&gt;The Open Catalog isn&apos;t just for prototyping : it&apos;s production-grade from day one. Organizations run terabyte-scale analytical workloads on the built-in catalog with automated performance management handling compaction, vacuuming, and Reflection optimization in the background. Start small with a few tables, then scale to hundreds of tables and dozens of users as your lakehouse grows. The same zero-configuration promise holds at scale.&lt;/p&gt;
&lt;p&gt;For teams new to the lakehouse concept, the Open Catalog is the lowest-friction entry point available. Data engineers familiar with SQL can build a complete medallion architecture (bronze → silver → gold) in a single afternoon, with AI capabilities and governance ready to activate immediately.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-dremio-open-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; and start building your lakehouse immediately.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Dremio Software to Dremio Cloud: Hybrid Federation Across Deployments</title><link>https://iceberglakehouse.com/posts/2026-03-connector-dremio-to-dremio/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-dremio-to-dremio/</guid><description>
Dremio Cloud can connect to Dremio Software (self-managed) instances as a federated data source. This creates a hybrid deployment where Dremio Cloud ...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Dremio Cloud can connect to Dremio Software (self-managed) instances as a federated data source. This creates a hybrid deployment where Dremio Cloud serves as the primary query interface while accessing datasets managed by Dremio Software instances running in your own data centers or private cloud.&lt;/p&gt;
&lt;p&gt;This connector is designed for organizations that have existing Dremio Software deployments and are adopting Dremio Cloud for new workloads, or that need to federate data across a cloud-managed Dremio platform and on-premises Dremio instances.&lt;/p&gt;
&lt;h2&gt;Why Connect Dremio Software to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;Hybrid Federation&lt;/h3&gt;
&lt;p&gt;Your Dremio Software instance manages on-premises data sources : Oracle databases, SQL Server, network-attached file storage, and internal data lakes. Dremio Cloud manages cloud-native sources , S3, BigQuery, Snowflake, and cloud-hosted databases. By connecting Dremio Software to Dremio Cloud, you can write a single SQL query that joins on-premises data (through Dremio Software) with cloud data (through Dremio Cloud).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join on-premises data via Dremio Software with cloud data in Dremio Cloud
SELECT
  cloud.customer_name,
  cloud.cloud_revenue,
  onprem.erp_balance,
  onprem.last_payment_date,
  CASE
    WHEN cloud.cloud_revenue &amp;gt; 100000 AND onprem.erp_balance &amp;lt; 5000 THEN &apos;Good Standing&apos;
    WHEN onprem.erp_balance &amp;gt; 50000 THEN &apos;At Risk&apos;
    ELSE &apos;Standard&apos;
  END AS account_health
FROM analytics.gold.cloud_customers cloud
JOIN &amp;quot;dremio-onprem&amp;quot;.onprem.erp_accounts onprem ON cloud.customer_id = onprem.customer_id
ORDER BY cloud.cloud_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Incremental Cloud Migration&lt;/h3&gt;
&lt;p&gt;Organizations don&apos;t shut down on-premises data centers overnight. Connecting Dremio Software to Dremio Cloud lets you:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start using Dremio Cloud&lt;/strong&gt; for new cloud-native workloads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continue using Dremio Software&lt;/strong&gt; for on-premises sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Federate across both&lt;/strong&gt; from a single Dremio Cloud interface&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gradually migrate&lt;/strong&gt; data sources from Software to Cloud as on-premises systems are decommissioned&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Consolidated Governance&lt;/h3&gt;
&lt;p&gt;Users access both on-premises and cloud data through Dremio Cloud&apos;s interface. Dremio Cloud&apos;s governance policies (column masking, row-level filtering) apply to the federated view of data, providing a single governance layer across all data.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Software instance&lt;/strong&gt; accessible from Dremio Cloud over HTTPS
&lt;ul&gt;
&lt;li&gt;Version 24.0 or later recommended&lt;/li&gt;
&lt;li&gt;Arrow Flight endpoint enabled and accessible (port 32010 or 443 with TLS)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; Username/password or Personal Access Token for the Dremio Software instance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network:&lt;/strong&gt; The Dremio Software instance must be reachable from Dremio Cloud&apos;s network. Options:
&lt;ul&gt;
&lt;li&gt;Public endpoint with TLS&lt;/li&gt;
&lt;li&gt;VPN/VPC peering&lt;/li&gt;
&lt;li&gt;AWS PrivateLink or equivalent&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; : &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-dremio-to-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Dremio Software to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio Cloud console and select &lt;strong&gt;Dremio&lt;/strong&gt; from the source types.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;dremio-onprem&lt;/code&gt; or &lt;code&gt;datacenter-west&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; The hostname or IP address of your Dremio Software coordinator node.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Arrow Flight port (typically &lt;code&gt;32010&lt;/code&gt;, or &lt;code&gt;443&lt;/code&gt; with TLS).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSL/TLS:&lt;/strong&gt; Enable if the Software instance uses encrypted connections.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Provide credentials for a Dremio Software user account. Consider creating a dedicated service account with appropriate permissions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read access to the virtual datasets (views) and physical datasets you want to federate&lt;/li&gt;
&lt;li&gt;User impersonation support if you want Dremio Cloud queries to execute as the requesting user on Dremio Software&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. User Impersonation&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;User impersonation&lt;/strong&gt; allows Dremio Cloud to pass the identity of the requesting user to Dremio Software. When enabled, queries executed through Dremio Cloud run with the permissions of the authenticated user on the Dremio Software side. This preserves your existing Dremio Software access control policies.&lt;/p&gt;
&lt;p&gt;Without impersonation, all Cloud queries execute as the service account configured in the connection, which may have broader access than individual users should.&lt;/p&gt;
&lt;h3&gt;5. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh, Metadata refresh intervals, and connection properties. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Querying Across Deployments&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query on-premises data through Dremio Software
SELECT
  department,
  employee_count,
  avg_salary
FROM &amp;quot;dremio-onprem&amp;quot;.hr.department_summary;

-- Join on-premises HR data with cloud-native analytics
SELECT
  d.department,
  d.employee_count,
  d.avg_salary,
  c.department_cloud_spend,
  ROUND(c.department_cloud_spend / d.employee_count, 2) AS cloud_cost_per_employee
FROM &amp;quot;dremio-onprem&amp;quot;.hr.department_summary d
JOIN analytics.gold.cloud_infrastructure_costs c ON d.department = c.department
ORDER BY cloud_cost_per_employee DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer Across Deployments&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.enterprise_360 AS
SELECT
  onprem.employee_id,
  onprem.employee_name,
  onprem.department,
  onprem.office_location,
  cloud.cloud_account_id,
  cloud.monthly_cloud_spend,
  CASE
    WHEN cloud.monthly_cloud_spend &amp;gt; 10000 THEN &apos;Heavy Cloud User&apos;
    WHEN cloud.monthly_cloud_spend &amp;gt; 1000 THEN &apos;Moderate&apos;
    ELSE &apos;Light&apos;
  END AS cloud_usage_tier
FROM &amp;quot;dremio-onprem&amp;quot;.hr.employees onprem
LEFT JOIN analytics.gold.cloud_accounts cloud ON onprem.employee_id = cloud.owner_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics Across Deployments&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions spanning both on-premises and cloud data: &amp;quot;Which departments have the highest cloud cost per employee?&amp;quot; or &amp;quot;Show me heavy cloud users in the engineering department.&amp;quot; The Agent reads your semantic layer&apos;s wiki descriptions and generates SQL that joins across both Dremio deployments.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your federated data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A CTO asks Claude &amp;quot;Compare cloud infrastructure costs per department with on-premises headcount&amp;quot; and gets insights spanning both deployment models.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify departments by cloud optimization potential
SELECT
  department,
  employee_count,
  cloud_cost_per_employee,
  AI_CLASSIFY(
    &apos;Based on cloud spending patterns, classify optimization potential&apos;,
    &apos;Department: &apos; || department || &apos;, Employees: &apos; || CAST(employee_count AS VARCHAR) || &apos;, Cloud Cost/Employee: $&apos; || CAST(cloud_cost_per_employee AS VARCHAR),
    ARRAY[&apos;Well Optimized&apos;, &apos;Room for Improvement&apos;, &apos;Over-Provisioned&apos;, &apos;Needs Audit&apos;]
  ) AS optimization_status
FROM (
  SELECT
    d.department,
    d.employee_count,
    ROUND(c.department_cloud_spend / d.employee_count, 2) AS cloud_cost_per_employee
  FROM &amp;quot;dremio-onprem&amp;quot;.hr.department_summary d
  JOIN analytics.gold.cloud_infrastructure_costs c ON d.department = c.department
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Important Considerations&lt;/h2&gt;
&lt;h3&gt;Network Latency&lt;/h3&gt;
&lt;p&gt;Cross-network queries between Dremio Cloud and on-premises Dremio Software add network latency. Optimize by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using Reflections to cache frequently accessed on-premises data in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Creating aggregated views on the Dremio Software side that pre-compute common metrics : transfer summarized data rather than raw tables&lt;/li&gt;
&lt;li&gt;Minimizing the amount of raw data transferred across the network&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cloud Egress Costs&lt;/h3&gt;
&lt;p&gt;Data returned from Dremio Software to Dremio Cloud may incur cloud egress charges if the Software instance runs in a different network or cloud provider. Strategies to minimize egress:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Build pre-aggregated views on the Software side&lt;/li&gt;
&lt;li&gt;Use Reflections to cache results (data transfers once per refresh, not per query)&lt;/li&gt;
&lt;li&gt;Filter data as close to the source as possible&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Version Compatibility&lt;/h3&gt;
&lt;p&gt;Keep Dremio Software at version 24.0 or later for best compatibility with Dremio Cloud. Older versions may have limited feature support through the federation connector.&lt;/p&gt;
&lt;h3&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Enable TLS for all connections between Dremio Cloud and Software&lt;/li&gt;
&lt;li&gt;Use a dedicated service account with minimal necessary permissions&lt;/li&gt;
&lt;li&gt;Enable user impersonation for proper access control propagation&lt;/li&gt;
&lt;li&gt;Consider network-level security (VPN, PrivateLink) for on-premises connections&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Monitoring and Troubleshooting&lt;/h3&gt;
&lt;p&gt;Monitor the health and performance of your hybrid deployment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query profiles:&lt;/strong&gt; Use Dremio Cloud&apos;s query profiler to identify slow cross-deployment queries. Look for high data transfer volumes that suggest missing Reflections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata refresh timing:&lt;/strong&gt; If Dremio Cloud shows stale schema from Dremio Software, decrease the metadata refresh interval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection pool management:&lt;/strong&gt; For high-concurrency workloads, monitor connection usage between Cloud and Software. Increase the maximum idle connections if you see connection timeout errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency benchmarks:&lt;/strong&gt; Establish baseline latency for cross-deployment queries. If latency degrades, check network connectivity and consider adding Reflections to cache frequently accessed data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Track these metrics to ensure your hybrid architecture delivers consistent performance as usage grows.&lt;/p&gt;
&lt;h2&gt;Governance Across Deployments&lt;/h2&gt;
&lt;p&gt;Dremio Cloud&apos;s Fine-Grained Access Control (FGAC) applies governance to the federated view of data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive on-premises fields (employee SSN, salary) from specific Cloud user roles&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional Cloud users see only their region&apos;s data from on-premises sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance rules apply whether data comes from the Software instance, Cloud sources, or external databases&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;BI tools connected to Dremio Cloud via Arrow Flight access both cloud and on-premises data through a single connection:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector : one connection serves data from both deployments&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; client for programmatic access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer : regardless of where the source data resides.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration enables developers to query federated data from their IDE. Ask Copilot &amp;quot;Compare cloud costs per department with on-premises headcount&amp;quot; and it generates SQL using your semantic layer that spans both deployments.&lt;/p&gt;
&lt;h2&gt;Reflections for Hybrid Optimization&lt;/h2&gt;
&lt;p&gt;Create Reflections on hybrid views to cache cross-deployment query results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build views that join Cloud and Software data&lt;/li&gt;
&lt;li&gt;Create Reflections on those views&lt;/li&gt;
&lt;li&gt;Set refresh intervals based on how frequently the underlying on-premises data changes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After creation, dashboard queries that span both deployments are served from Dremio Cloud&apos;s Reflection cache , eliminating network latency for repeat queries.&lt;/p&gt;
&lt;h2&gt;Migration Planning: Software to Cloud&lt;/h2&gt;
&lt;p&gt;Use the Dremio-to-Dremio connector as a migration bridge:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 : Federation:&lt;/strong&gt; Connect Dremio Software to Dremio Cloud. All existing Software views remain accessible from Cloud.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 : Parallel Development:&lt;/strong&gt; Build new views and Reflections in Dremio Cloud while continuing to maintain Software views.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 3 : Source Migration:&lt;/strong&gt; Gradually move individual data sources (PostgreSQL, Oracle, S3) from Software connections to Cloud connections. Update views to reference Cloud-native sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 4 : Decommission:&lt;/strong&gt; Once all sources are connected to Cloud, remove the Dremio Software connection.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;During the migration, users experience no disruption : they continue querying through Dremio Cloud while the underlying sources are being transitioned.&lt;/p&gt;
&lt;h2&gt;Common Deployment Architectures&lt;/h2&gt;
&lt;h3&gt;Hub-and-Spoke Model&lt;/h3&gt;
&lt;p&gt;Dremio Cloud serves as the central hub, with multiple Dremio Software instances as spokes. Each spoke manages a specific data center or business unit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Spoke A:&lt;/strong&gt; Finance data center (Oracle, SQL Server, DB2)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spoke B:&lt;/strong&gt; Manufacturing data center (SAP HANA, PostgreSQL)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spoke C:&lt;/strong&gt; Research data center (S3, MongoDB)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio Cloud federates across all spokes, providing a single analytics interface for the entire organization.&lt;/p&gt;
&lt;h3&gt;Staged Migration Model&lt;/h3&gt;
&lt;p&gt;For organizations migrating to the cloud in waves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Wave 1:&lt;/strong&gt; Non-sensitive workloads migrate to Cloud with direct source connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wave 2:&lt;/strong&gt; Sensitive workloads use Software as a proxy (governance-compliant data access)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wave 3:&lt;/strong&gt; Remaining workloads migrate as regulatory and security requirements are met&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Disaster Recovery Model&lt;/h3&gt;
&lt;p&gt;Dremio Software serves as a fallback if Cloud connectivity is temporarily unavailable. On-premises critical workloads run against Software; Cloud handles all other analytics. This architecture provides business continuity for mission-critical dashboards and reports.&lt;/p&gt;
&lt;h2&gt;Performance Best Practices&lt;/h2&gt;
&lt;p&gt;Maximize hybrid performance with these strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pre-aggregate on Software side:&lt;/strong&gt; Build views in Dremio Software that SUM, COUNT, and AVG at the granularity Cloud queries need. Transfer megabytes, not gigabytes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Reflections aggressively:&lt;/strong&gt; Create Reflections on every cross-deployment view. Network latency disappears once results are cached.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schedule Reflection refreshes strategically:&lt;/strong&gt; Refresh during off-peak hours when network bandwidth is available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor data transfer volumes:&lt;/strong&gt; Use Dremio Cloud&apos;s query profiler to identify queries that transfer large volumes from Software. Convert these to Reflections first.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Organizations can seamlessly federate across Dremio deployments, enable AI analytics on combined on-premises and cloud data, and migrate incrementally to the cloud : all while maintaining unified governance. The Dremio-to-Dremio connector is the bridge that makes hybrid lakehouse analytics practical.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re running a single Dremio Software instance in one data center or managing multiple Software installations across global facilities, Dremio Cloud provides a unified analytical interface. Combine the raw data processing power of on-premises Dremio Software with the AI capabilities, Reflections, and managed infrastructure of Dremio Cloud. The result is a truly hybrid analytics platform that grows with your cloud migration at whatever pace your organization requires. No rip-and-replace, no big-bang migration : just a gradual, governed transition that protects your existing investments.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-dremio-to-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your existing Dremio Software instances.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Engineering Best Practices: The Complete Checklist</title><link>https://iceberglakehouse.com/posts/2026-02-debp-de-best-practices-checklist/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-de-best-practices-checklist/</guid><description>
![Comprehensive data engineering checklist organized by categories with status indicators](/assets/images/debp/10/de-checklist.png)

Best practices d...</description><pubDate>Wed, 18 Feb 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/10/de-checklist.png&quot; alt=&quot;Comprehensive data engineering checklist organized by categories with status indicators&quot;&gt;&lt;/p&gt;
&lt;p&gt;Best practices documents are easy to write and hard to use. They list principles without context, advice without prioritization, and rules without explaining when to break them. This one is different. It&apos;s a practical, tool-agnostic checklist organized by the categories that matter most : with each item tied to a specific outcome.&lt;/p&gt;
&lt;p&gt;Use this as a recurring audit. Run through it quarterly. Any unchecked item is either a technical debt item or a conscious tradeoff. Know which is which.&lt;/p&gt;
&lt;h2&gt;Pipeline Design&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Separate ingestion from transformation.&lt;/strong&gt; Raw data lands unchanged. Business logic runs separately. This lets you replay raw data and isolate failures.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Model pipelines as DAGs.&lt;/strong&gt; Each stage has explicit inputs and outputs. Independent stages run in parallel. Failed stages retry alone.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Make dependencies explicit.&lt;/strong&gt; If pipeline B needs the output of pipeline A, declare that dependency in your orchestrator. Don&apos;t rely on timing assumptions.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use sensors or triggers for scheduling.&lt;/strong&gt; Wait for data to arrive, not for the clock to hit a certain time. Data-driven triggers are more reliable than cron jobs.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Keep stages single-purpose.&lt;/strong&gt; An ingestion stage should not also validate, transform, and load. Each stage does one thing and does it well.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Data Quality&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Validate schema at ingestion.&lt;/strong&gt; Compare incoming data against expected column names, types, and nullability. Catch schema drift before it propagates.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Check completeness.&lt;/strong&gt; Required fields have no nulls. If &lt;code&gt;customer_id&lt;/code&gt; is nullable in your orders table, downstream joins will silently lose rows.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Enforce uniqueness.&lt;/strong&gt; Primary keys have no duplicates. Run dedup checks after every load. Double-counted records are worse than missing records.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Quarantine bad records.&lt;/strong&gt; Route validation failures to a quarantine table with metadata (which check failed, when, the original record). Never drop records silently.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Track quality metrics.&lt;/strong&gt; Null rates, duplicate rates, and range violations tracked per pipeline, per day. Trend these metrics to catch gradual degradation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/10/quality-checklist.png&quot; alt=&quot;Data quality checklist: schema validation, completeness, uniqueness, quarantine&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Reliability and Idempotency&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Make every pipeline idempotent.&lt;/strong&gt; Running the same job twice produces the same result. Use partition overwrite or MERGE : never blind INSERT.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Implement retry with backoff.&lt;/strong&gt; Transient failures (network, API limits) resolve themselves. Retry 3-5 times with exponential backoff before alerting.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use dead-letter queues.&lt;/strong&gt; Records that can&apos;t be processed go to a queue for inspection, not to /dev/null.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Checkpoint progress.&lt;/strong&gt; After processing each batch or partition, record what&apos;s done. On failure, resume from the last checkpoint.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Design for failure.&lt;/strong&gt; Every component will fail. Define the expected behavior for each failure mode: retry, skip and log, alert, or halt.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Schema Management&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Treat your schema as an API.&lt;/strong&gt; Column names are fields. Tables are endpoints. Consumers are clients. Changing the schema without coordination is as bad as changing an API without versioning.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use additive-only changes.&lt;/strong&gt; Add new columns. Never remove or rename columns without a deprecation period.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Enforce contracts at boundaries.&lt;/strong&gt; Validate that incoming schema matches expectations at ingestion. Validate that outgoing schema matches consumer contracts at serving.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Version breaking changes.&lt;/strong&gt; When a schema must change incompatibly, version it (v1, v2). Let consumers migrate on their own schedule.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Document every column.&lt;/strong&gt; Column name, type, description, source, owner. If an engineer can&apos;t find this information in under 30 seconds, it&apos;s not documented.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Testing and Validation&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Run schema tests on every pipeline execution.&lt;/strong&gt; Column existence, data types, not-null constraints. These are fast, cheap, and catch the most common problems.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Run uniqueness and null checks on primary keys.&lt;/strong&gt; The two most impactful data quality tests. Add them today.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Compare row counts against baselines.&lt;/strong&gt; Alert when today&apos;s count deviates by more than 20% from the trailing average. Catches missing data and unexpected volume spikes.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Test transformation logic with fixtures.&lt;/strong&gt; Small, known-good input datasets with expected outputs. Run these in CI before deploying pipeline changes.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Add regression tests for key business metrics.&lt;/strong&gt; Total revenue, distinct customer count, and other critical aggregations compared against previous runs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Observability and Monitoring&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Track data freshness per table.&lt;/strong&gt; The timestamp of the most recent row. Alert when it exceeds the SLA. This single metric catches more problems than any other.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Alert on business impact, not every error.&lt;/strong&gt; SLA violations, quality regressions, and anomalous volume changes are alerts. Transient retries and expected maintenance are not.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use structured logging.&lt;/strong&gt; JSON-formatted log entries with pipeline name, stage, batch ID, timestamp, row count, and status. Searchable, parseable, filterable.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Build data lineage.&lt;/strong&gt; Know where each table&apos;s data comes from and where it goes. Column-level lineage turns &amp;quot;the numbers are wrong&amp;quot; from a half-day investigation into a 10-minute graph traversal.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Review observability quarterly.&lt;/strong&gt; Are alerts still relevant? Are thresholds still accurate? Are dashboards still used? Trim unactionable alerts and update stale baselines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/10/observability-checklist.png&quot; alt=&quot;Observability checklist: freshness tracking, alert severity, structured logs, lineage&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Print this checklist. Walk through it with your team in a 30-minute meeting. Check what&apos;s already in place, identify the three highest-impact unchecked items, and schedule them as engineering work : not aspirational goals on a wiki page. Best practices only matter when they&apos;re implemented.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling Best Practices: 7 Mistakes to Avoid</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-best-practices/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-best-practices/</guid><description>
![Checklist of data modeling quality markers with warning symbols on common mistakes](/assets/images/data_modeling/10/best-practices-checklist.png)

...</description><pubDate>Wed, 18 Feb 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/best-practices-checklist.png&quot; alt=&quot;Checklist of data modeling quality markers with warning symbols on common mistakes&quot;&gt;&lt;/p&gt;
&lt;p&gt;A bad data model doesn&apos;t announce itself. It hides behind slow dashboards, conflicting numbers, confused analysts, and AI agents that generate wrong SQL. By the time someone identifies the model as the root cause, the team has already built dozens of reports on top of it.&lt;/p&gt;
&lt;p&gt;Here are seven modeling mistakes that create downstream pain : and how to avoid each one.&lt;/p&gt;
&lt;h2&gt;Mistake 1: No Defined Grain&lt;/h2&gt;
&lt;p&gt;The grain declares what one row in a fact table represents. &amp;quot;One row per order line item.&amp;quot; &amp;quot;One row per daily user session.&amp;quot; &amp;quot;One row per monthly account balance.&amp;quot;&lt;/p&gt;
&lt;p&gt;Without a declared grain, aggregation produces wrong numbers. If some rows represent individual transactions and others represent daily summaries, a SUM query double-counts or under-counts depending on the mix.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Before designing any fact table, write down the grain in one sentence. Share it with your team. If you can&apos;t state the grain clearly, the table isn&apos;t ready for production.&lt;/p&gt;
&lt;h2&gt;Mistake 2: Cryptic Naming&lt;/h2&gt;
&lt;p&gt;Columns named &lt;code&gt;c1&lt;/code&gt;, &lt;code&gt;dt&lt;/code&gt;, &lt;code&gt;amt&lt;/code&gt;, &lt;code&gt;flg&lt;/code&gt;, and &lt;code&gt;cat_cd&lt;/code&gt; save keystrokes during development but cost hours during analysis. Every analyst who encounters these names must either read the ETL code, ask the engineer, or guess.&lt;/p&gt;
&lt;p&gt;AI agents have the same problem. An agent asked to calculate &amp;quot;total revenue&amp;quot; can&apos;t identify the right column if it&apos;s called &lt;code&gt;amt3&lt;/code&gt; instead of &lt;code&gt;revenue_usd&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use descriptive, business-friendly names. &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;revenue_usd&lt;/code&gt;, &lt;code&gt;is_active&lt;/code&gt;, &lt;code&gt;product_category&lt;/code&gt;. Include units where ambiguous (&lt;code&gt;weight_kg&lt;/code&gt;, &lt;code&gt;duration_minutes&lt;/code&gt;). Use &lt;code&gt;snake_case&lt;/code&gt; consistently.&lt;/p&gt;
&lt;h2&gt;Mistake 3: Skipping the Conceptual Model&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/conceptual-foundation.png&quot; alt=&quot;Conceptual model as the foundation layer that business and technical teams align on&quot;&gt;&lt;/p&gt;
&lt;p&gt;Going straight from a stakeholder request to &lt;code&gt;CREATE TABLE&lt;/code&gt; skips the alignment step. Engineers build what they understand from the request. Stakeholders assumed something different. The gap surfaces weeks or months later when reports don&apos;t match expectations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; For every new business domain, create a conceptual model first. List the entities, name the relationships, and get business stakeholder sign-off before writing any SQL.&lt;/p&gt;
&lt;h2&gt;Mistake 4: Over-Normalizing for Analytics&lt;/h2&gt;
&lt;p&gt;Third Normal Form (3NF) is correct for transactional systems where writes are frequent and consistency matters. Applied to an analytics workload, it creates queries with 10-15 joins that run slowly and break easily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Separate your transactional model from your analytical model. Keep the OLTP system in 3NF. Build a denormalized star schema (or a set of wide views) for analytics. Different workloads deserve different models.&lt;/p&gt;
&lt;h2&gt;Mistake 5: Under-Documenting&lt;/h2&gt;
&lt;p&gt;A data model without documentation is a puzzle that only its creator can solve. And even they forget the details after a few months.&lt;/p&gt;
&lt;p&gt;Without documentation, every new team member reverse-engineers the model from scratch. Every AI agent generates SQL based on guesses. Every analyst interprets column meanings differently, leading to metric discrepancies that take weeks to reconcile.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Document at three levels:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column level:&lt;/strong&gt; What does each column mean? Where does it come from?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table level:&lt;/strong&gt; What grain does this table use? Who maintains it?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model level:&lt;/strong&gt; How do tables connect? What business process does this model represent?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; make this practical with built-in Wikis for every dataset and Labels for classification (PII, Certified, Raw, Deprecated). The documentation lives next to the data, not in a separate spreadsheet that goes stale.&lt;/p&gt;
&lt;h2&gt;Mistake 6: One Model for Every Workload&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/one-model-problem.png&quot; alt=&quot;Single model struggling to serve transactions, analytics, and AI simultaneously&quot;&gt;&lt;/p&gt;
&lt;p&gt;A model designed for a transactional application doesn&apos;t serve analytics well. A model designed for analytics doesn&apos;t serve a machine learning feature store well. Trying to make one model serve every use case leads to compromises that serve no use case well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Build purpose-specific models layered on top of shared source data. The Medallion Architecture does this naturally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bronze:&lt;/strong&gt; Raw data from sources (shared foundation)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver:&lt;/strong&gt; Business logic layer (shared across analytics and ML)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold:&lt;/strong&gt; Purpose-built views (one for dashboards, one for ML features, one for AI agents)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each Gold view is tailored to its consumer without duplicating the transformation logic in Silver.&lt;/p&gt;
&lt;h2&gt;Mistake 7: Ignoring Governance&lt;/h2&gt;
&lt;p&gt;Data models don&apos;t exist in a vacuum. They contain PII, financial data, health records, and other sensitive information. Ignoring governance creates compliance risk and erodes trust.&lt;/p&gt;
&lt;p&gt;Common governance gaps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No access controls (everyone sees everything)&lt;/li&gt;
&lt;li&gt;No classification (no one knows which columns contain PII)&lt;/li&gt;
&lt;li&gt;No ownership (no one knows who to ask about table X)&lt;/li&gt;
&lt;li&gt;No lineage (no one knows where the data came from)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Integrate governance from day one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tag columns by sensitivity (PII, financial, public)&lt;/li&gt;
&lt;li&gt;Assign ownership per table or domain&lt;/li&gt;
&lt;li&gt;Apply row and column-level access policies&lt;/li&gt;
&lt;li&gt;Document data lineage from source to consumption&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In Dremio, Fine-Grained Access Control enforces row and column-level policies, Labels classify datasets, and the Open Catalog tracks lineage. Governance is part of the platform, not an afterthought.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/modeling-cycle.png&quot; alt=&quot;Iterative data modeling cycle: design, document, measure, improve&quot;&gt;&lt;/p&gt;
&lt;p&gt;Pick one of these seven mistakes. Check whether your current data model has it. Fix it. Then move to the next one. Data modeling is iterative : no team gets it perfect on the first pass. The goal is not perfection but continuous improvement: clearer names, better documentation, tighter governance, and models that match what your consumers actually need.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic Layer Best Practices: 7 Mistakes to Avoid</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-best-practices/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-best-practices/</guid><description>
![Semantic layer best practices checklist : checks and mistakes](/assets/images/semantic_layer/10/best-practices.png)

Semantic layers don&apos;t fail bec...</description><pubDate>Wed, 18 Feb 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/10/best-practices.png&quot; alt=&quot;Semantic layer best practices checklist : checks and mistakes&quot;&gt;&lt;/p&gt;
&lt;p&gt;Semantic layers don&apos;t fail because the technology is wrong. They fail because of design decisions made in the first two weeks : choices that seem reasonable at the time and create compounding problems for months afterward.&lt;/p&gt;
&lt;p&gt;Here are the seven mistakes that kill semantic layer projects, and how to avoid each one.&lt;/p&gt;
&lt;h2&gt;Mistake 1: Defining Metrics in Multiple Places&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Revenue is defined in a Tableau calculated field, a Power BI DAX measure, a dbt model, and a SQL view. Four sources of truth. None of them agree.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Teams adopt new tools without migrating metric definitions. Each tool gets its own model. Over time, the definitions drift.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Every metric gets exactly one canonical definition in the semantic layer. All downstream tools query that definition. No exceptions. When someone needs Revenue, they query &lt;code&gt;business.revenue&lt;/code&gt;, not their own formula.&lt;/p&gt;
&lt;p&gt;This principle extends to AI agents. If your AI generates its own metric formulas instead of referencing the semantic layer, you&apos;ve just added another source of truth : the least trustworthy one.&lt;/p&gt;
&lt;h2&gt;Mistake 2: Skipping the Bronze Layer&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A data engineer creates a Silver view that joins raw source tables directly, mixing data cleanup (type casting, column renaming) with business logic (filters, calculations) in a single query. When the source schema changes :  a column is renamed, a type is modified ,  the Silver view breaks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: The Bronze layer feels redundant. It&apos;s just a 1:1 mapping of the source. Why add a layer that doesn&apos;t change anything?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: The Bronze layer absorbs schema changes. When a source renames &lt;code&gt;col_7&lt;/code&gt; to &lt;code&gt;order_date_utc&lt;/code&gt;, you update one Bronze view. The Silver and Gold views above it don&apos;t change. This insulation is worth the tiny overhead of maintaining passthrough views.&lt;/p&gt;
&lt;p&gt;Bronze views also standardize data formats. Timestamps normalized to UTC. Strings cast to consistent encodings. Column names made human-readable. This cleanup happens once, at the bottom of the stack, and every view above benefits.&lt;/p&gt;
&lt;h2&gt;Mistake 3: Using SQL Reserved Words as Column Names&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/10/naming-conventions.png&quot; alt=&quot;Bad vs. good naming conventions : cryptic abbreviations vs. clear business names&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A Bronze view exposes a column called &lt;code&gt;Date&lt;/code&gt;. Now every downstream query must reference &lt;code&gt;&amp;quot;Date&amp;quot;&lt;/code&gt; with double quotes. Analysts forget. AI agents don&apos;t quote it at all. Queries break intermittently. Debugging is frustrating because the error messages are cryptic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Source systems often use generic names. &lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;Timestamp&lt;/code&gt;, &lt;code&gt;Order&lt;/code&gt;, &lt;code&gt;Group&lt;/code&gt;, &lt;code&gt;Role&lt;/code&gt; : all are SQL reserved words. Bronze views that don&apos;t rename them propagate the problem to every consumer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Rename early. In the Bronze layer, map &lt;code&gt;Date&lt;/code&gt; to &lt;code&gt;TransactionDate&lt;/code&gt;, &lt;code&gt;Timestamp&lt;/code&gt; to &lt;code&gt;EventTimestamp&lt;/code&gt;, &lt;code&gt;Order&lt;/code&gt; to &lt;code&gt;CustomerOrder&lt;/code&gt;. Use domain-specific prefixes that are unambiguous and never conflict with SQL keywords.&lt;/p&gt;
&lt;p&gt;This small decision saves hundreds of hours of debugging across the life of the semantic layer. It also dramatically improves AI agent accuracy, since language models generating SQL rarely add appropriate quoting for reserved words.&lt;/p&gt;
&lt;h2&gt;Mistake 4: Building Without Stakeholder Input&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A data engineering team builds 50 Silver views based on the database schema. They expose every table, every column, every possible metric. Business users look at the result, don&apos;t recognize any of the terms, and go back to their spreadsheets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Data engineers understand the schema. They assume the schema structure maps to business needs. It usually doesn&apos;t.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Start with a metric glossary co-created with stakeholders from Sales, Finance, Marketing, and Product. Ask them: What are your top 5 metrics? How do you calculate them? What decisions do they drive? Build the Silver layer around those answers, not around the database schema.&lt;/p&gt;
&lt;p&gt;This step feels slow. It&apos;s the fastest path to adoption. A semantic layer that uses business language and models business concepts gets adopted. A semantic layer that mirrors the database schema gets ignored.&lt;/p&gt;
&lt;h2&gt;Mistake 5: Treating Documentation as Optional&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Views are created with no Wikis, no column descriptions, no Labels. The semantic layer works for the person who built it. Everyone else :  analysts, AI agents, new team members ,  can&apos;t figure out what the views mean.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Documentation takes time. Deadlines are tight. Teams plan to &amp;quot;add documentation later.&amp;quot; Later never comes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Make documentation part of the view creation process, not a follow-up task. At minimum, every view gets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A one-sentence description of what it represents&lt;/li&gt;
&lt;li&gt;Labels for governance (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;Column descriptions for any non-obvious field&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Modern platforms reduce this burden with AI-generated documentation. &lt;a href=&quot;https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s generative AI&lt;/a&gt; samples table data and auto-generates Wiki descriptions and Label suggestions. The AI provides a 70% first draft. The data team adds domain context for the other 30%.&lt;/p&gt;
&lt;p&gt;Undocumented views are invisible to AI agents. If the Wiki is empty, the AI agent has no context to generate accurate SQL. Documentation isn&apos;t just nice to have. It&apos;s an accuracy requirement.&lt;/p&gt;
&lt;h2&gt;Mistake 6: Applying Security at the BI Tool Level Only&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Row-level security is configured in Tableau so regional managers only see their region. Then an analyst opens a SQL client, queries the underlying table directly, and sees all regions. The security was enforced in the dashboard, not in the data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: BI tools make it easy to apply filters and security rules. Data platforms require more setup. Teams take the easy path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Enforce access policies at the semantic layer, not the BI layer. Row-level security and column masking should be applied on the virtual datasets (views). Every query path :  dashboard, notebook, API, AI agent ,  inherits the same rules.&lt;/p&gt;
&lt;p&gt;Dremio implements this through Fine-Grained Access Control (FGAC): policies defined as UDFs at the view level. A regional manager queries &lt;code&gt;business.revenue&lt;/code&gt; and automatically sees only their region, regardless of how they access the data. No security gaps between tools.&lt;/p&gt;
&lt;h2&gt;Mistake 7: Trying to Model Everything at Once&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/10/incremental-growth.png&quot; alt=&quot;Incremental growth : from a small core to a comprehensive semantic layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: The team commits to building a complete semantic layer covering every source, every table, and every metric. The project takes six months. By the time it launches, requirements have changed, stakeholder interest has waned, and half the views are out of date.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Ambitious leaders want a &amp;quot;complete&amp;quot; solution. Data teams want to avoid rework. Neither wants to ship an incomplete layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Start with 3-5 core metrics that the organization actively debates (usually Revenue, Active Users, Churn). Build one Bronze → Silver → Gold pipeline per metric. Validate that the same question produces the same answer across two different tools.&lt;/p&gt;
&lt;p&gt;Once those metrics are stable, expand incrementally. Add new sources, new views, new metrics : one at a time. Each addition is low-risk because the layered architecture isolates changes. A new Gold view doesn&apos;t affect existing Silver views.&lt;/p&gt;
&lt;p&gt;The fastest semantic layers reach 80% organizational coverage not by modeling everything up front, but by proving value quickly and expanding from momentum.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick one mistake from this list. Check whether your semantic layer (or your plan for one) is making it. Fix that one thing this week. Then come back for the next one.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Pipeline Observability: Know When Things Break</title><link>https://iceberglakehouse.com/posts/2026-02-debp-observability-monitoring/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-observability-monitoring/</guid><description>
![Pipeline observability dashboard showing metrics, logs, and data lineage](/assets/images/debp/09/observability-dashboard.png)

An analyst messages ...</description><pubDate>Wed, 18 Feb 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/09/observability-dashboard.png&quot; alt=&quot;Pipeline observability dashboard showing metrics, logs, and data lineage&quot;&gt;&lt;/p&gt;
&lt;p&gt;An analyst messages you on Slack: &amp;quot;The revenue numbers look wrong. Is the pipeline broken?&amp;quot; You check the orchestrator : all green. You check the target table , data loaded this morning. You check the row count : looks normal. Forty-five minutes later, you discover that a source API returned empty responses for one region, and the pipeline happily loaded zero rows for that region without alerting anyone.&lt;/p&gt;
&lt;p&gt;The pipeline succeeded. The data was wrong. No one knew until a human noticed.&lt;/p&gt;
&lt;p&gt;This is the cost of monitoring pipeline execution without monitoring pipeline output.&lt;/p&gt;
&lt;h2&gt;You Can&apos;t Fix What You Can&apos;t See&lt;/h2&gt;
&lt;p&gt;Traditional monitoring answers: did the job run? Did it succeed? How long did it take? These questions cover infrastructure health, not data health. A pipeline can execute perfectly : no errors, no retries, no timeouts , and still produce incorrect or incomplete data.&lt;/p&gt;
&lt;p&gt;Observability goes further. It answers: what did the pipeline process? How much? Was the data complete and correct? Is the output fresh? And when something is wrong, it provides enough context to diagnose the root cause without hunting through logs manually.&lt;/p&gt;
&lt;p&gt;The distinction matters. Monitoring tells you the pipeline ran. Observability tells you the pipeline worked.&lt;/p&gt;
&lt;h2&gt;The Three Pillars of Pipeline Observability&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Metrics.&lt;/strong&gt; Quantitative measurements collected at every pipeline stage: row counts, processing time, error rates, data freshness, resource utilization. Metrics are cheap to collect, easy to aggregate, and essential for dashboards and alerting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Logs.&lt;/strong&gt; Structured, timestamped records of what happened during execution. A useful log entry includes: pipeline name, stage name, batch ID, timestamp, action (started/completed/failed), row count, and any error message. Structured logs (JSON format) are searchable and parseable. Unstructured logs (&amp;quot;Processing data...&amp;quot;) are noise.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lineage.&lt;/strong&gt; The path data takes from source to destination, at the table or column level. Lineage answers: where did this number come from? If the source changes, what downstream tables and dashboards are affected? Lineage turns debugging from archaeology into graph traversal.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/09/three-pillars.png&quot; alt=&quot;Three pillars: metrics tracking counts and timing, logs recording execution details, lineage mapping data flow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Measure&lt;/h2&gt;
&lt;p&gt;Not everything needs a metric. Measure what helps you answer these questions:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the data fresh?&lt;/strong&gt; Track the timestamp of the most recent row in each target table. Compare it to the expected freshness (e.g., less than 2 hours old). A freshness metric that exceeds its SLA triggers an alert before anyone opens a dashboard.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the data complete?&lt;/strong&gt; Track row counts in vs. row counts out at each stage. A significant drop (e.g., input: 100,000 rows, output: 90,000 rows) means records were filtered, rejected, or lost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the data correct?&lt;/strong&gt; Track quality metrics: null rates, duplicate rates, range violation counts. Trend these over time. A gradual increase in null rates indicates a deteriorating source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the pipeline healthy?&lt;/strong&gt; Track execution time per stage. A stage that normally takes 5 minutes but now takes 50 minutes may indicate data volume growth, resource contention, or a bad query plan.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the pipeline meeting SLAs?&lt;/strong&gt; Define when data must be available (e.g., daily tables loaded by 6 AM). Track SLA compliance as a percentage. A pipeline with 95% SLA compliance has failed its consumers once every 20 days.&lt;/p&gt;
&lt;h2&gt;Alerting Without Alert Fatigue&lt;/h2&gt;
&lt;p&gt;Alert fatigue is the most common reason observability fails. Too many alerts and the on-call engineer starts ignoring them. Too few and real problems go unnoticed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alert on business impact, not on every error.&lt;/strong&gt; A transient retry is not an alert. A pipeline that misses its SLA by an hour is. A single null row is not an alert. A null rate jumping from 0.1% to 15% is.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use severity levels.&lt;/strong&gt; Critical: data consumers are affected now (missed SLA, empty output). Warning: something is degrading but not yet impacting consumers (execution time increasing, row count declining). Info: notable but non-actionable (successful backfill, schema migration completed).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Set thresholds dynamically.&lt;/strong&gt; Static thresholds (&amp;quot;alert if row count &amp;lt; 10,000&amp;quot;) break when data naturally grows or shrinks. Use rolling baselines: alert if today&apos;s row count deviates by more than 20% from the 7-day average.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Route alerts effectively.&lt;/strong&gt; Critical alerts go to PagerDuty or on-call channels. Warnings go to team Slack channels. Info goes to logs-only. Don&apos;t send everything to the same channel.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/09/alert-levels.png&quot; alt=&quot;Alert severity levels: critical triggers pages, warning goes to channel, info logged&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Data Lineage for Impact Analysis&lt;/h2&gt;
&lt;p&gt;When a problem occurs, the first question is: what&apos;s affected? Lineage answers this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Upstream analysis.&lt;/strong&gt; A dashboard shows wrong numbers. Lineage traces the dashboard back through the serving table, the transformation, the staging table, and the raw source. The break is visible in the graph.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Downstream impact analysis.&lt;/strong&gt; A source system announces a schema change. Lineage shows every table, model, and dashboard that depends on that source. You know the blast radius before making any changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column-level lineage.&lt;/strong&gt; Table-level lineage shows connections between tables. Column-level lineage shows which source column feeds which target column. This level of detail turns a &amp;quot;the revenue is wrong&amp;quot; investigation from hours to minutes.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Add freshness tracking to your three most critical tables: record the max event timestamp after each load and alert when it exceeds the SLA. This single metric : data freshness , catches more problems than any other observability signal.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Vault Modeling: Hubs, Links, and Satellites</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-vault-modeling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-vault-modeling/</guid><description>
![Data Vault model showing Hubs, Links, and Satellites as interconnected components](/assets/images/data_modeling/09/data-vault-overview.png)

Dimens...</description><pubDate>Wed, 18 Feb 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/09/data-vault-overview.png&quot; alt=&quot;Data Vault model showing Hubs, Links, and Satellites as interconnected components&quot;&gt;&lt;/p&gt;
&lt;p&gt;Dimensional modeling works well when your source systems are stable and your business questions are predictable. But what happens when sources change constantly, new systems get added every quarter, and regulatory requirements demand a full audit trail of every attribute change?&lt;/p&gt;
&lt;p&gt;Data Vault modeling was designed for exactly this scenario. Created by Dan Linstedt, it separates data into three distinct table types :  Hubs, Links, and Satellites ,  each handling a different concern: identity, relationships, and descriptive context.&lt;/p&gt;
&lt;h2&gt;What Problem Data Vault Solves&lt;/h2&gt;
&lt;p&gt;Traditional dimensional models embed everything about a business entity in one dimension table. A &lt;code&gt;dim_customers&lt;/code&gt; table contains the customer ID, name, address, segment, acquisition channel, and lifetime value. When a new source system provides additional customer attributes, you add columns to &lt;code&gt;dim_customers&lt;/code&gt;. When business rules change how &amp;quot;segment&amp;quot; is calculated, you update the ETL pipeline that populates that table.&lt;/p&gt;
&lt;p&gt;Over time, these dimension tables become fragile. They depend on multiple source systems. A change in one source breaks the ETL. Schema changes require coordinated updates across pipelines, tables, and downstream reports.&lt;/p&gt;
&lt;p&gt;Data Vault solves this by decomposing entities into independent components that can evolve separately.&lt;/p&gt;
&lt;h2&gt;The Three Building Blocks&lt;/h2&gt;
&lt;h3&gt;Hubs: Business Identity&lt;/h3&gt;
&lt;p&gt;A Hub stores unique business keys : the identifiers that define a business entity regardless of which source system provides them.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE hub_customer (
    customer_hash_key BINARY(32),  -- Hash of the business key
    customer_id VARCHAR(50),        -- Natural business key
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hubs are immutable. Once a business key is loaded, it never changes. A customer who has &lt;code&gt;customer_id = &apos;C-1042&apos;&lt;/code&gt; always has that key. Hubs answer the question: &lt;em&gt;What business concepts exist?&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Links: Relationships&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/09/hub-link-relationship.png&quot; alt=&quot;Hubs connected by Link tables representing relationships between business entities&quot;&gt;&lt;/p&gt;
&lt;p&gt;A Link stores relationships between Hubs. Every relationship :  customer-to-order, order-to-product, employee-to-department ,  gets its own Link table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE link_customer_order (
    link_hash_key BINARY(32),
    customer_hash_key BINARY(32),
    order_hash_key BINARY(32),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Links are also immutable. Once a relationship is recorded, it stays. Links support many-to-many relationships by default. They answer the question: &lt;em&gt;How are business concepts related?&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Satellites: Descriptive Context&lt;/h3&gt;
&lt;p&gt;Satellites store the descriptive attributes of a Hub or Link, along with their change history.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE sat_customer_details (
    customer_hash_key BINARY(32),
    effective_date TIMESTAMP,
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city VARCHAR(100),
    segment VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every time an attribute changes, a new Satellite row is inserted. This is equivalent to SCD Type 2 : full history is preserved without modifying existing rows. Different source systems can feed different Satellites for the same Hub, allowing attributes to arrive independently.&lt;/p&gt;
&lt;h2&gt;How a Data Vault Query Works&lt;/h2&gt;
&lt;p&gt;To reconstruct a business entity (like a current customer profile), you join the Hub to its current Satellite rows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
    h.customer_id,
    s.customer_name,
    s.email,
    s.city,
    s.segment
FROM hub_customer h
JOIN sat_customer_details s ON h.customer_hash_key = s.customer_hash_key
WHERE s.effective_date = (
    SELECT MAX(effective_date)
    FROM sat_customer_details s2
    WHERE s2.customer_hash_key = s.customer_hash_key
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is more complex than querying &lt;code&gt;dim_customers&lt;/code&gt; directly. That complexity is the primary criticism of Data Vault. In practice, teams build a presentation layer :  star schema views on top of the vault ,  for business users and BI tools.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; make this practical. The raw vault tables live in the Bronze layer. Silver-layer views reconstruct business entities by joining Hubs, Links, and Satellites. Gold-layer views present dimensional star schemas for dashboards and AI agents. Users never query the vault tables directly.&lt;/p&gt;
&lt;h2&gt;When Data Vault Fits&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Multiple source systems that change frequently.&lt;/strong&gt; Adding a new source means adding new Satellites : not redesigning existing tables. The Hub and Link structure remains stable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Regulated industries requiring full audit trails.&lt;/strong&gt; Financial services, healthcare, and government often need to prove what data looked like at any point in time. Satellites provide that out of the box.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Large enterprises with parallel development teams.&lt;/strong&gt; Hubs, Links, and Satellites can be loaded independently, enabling parallel ETL development without pipeline conflicts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Long-term data warehouses with decades of history.&lt;/strong&gt; The separation of structure (Hubs, Links) from content (Satellites) makes the vault resilient to business changes over time.&lt;/p&gt;
&lt;h2&gt;When Data Vault Doesn&apos;t Fit&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Small teams or simple source environments.&lt;/strong&gt; If you have five source tables and one BI tool, Data Vault adds complexity without proportional benefit. A star schema is faster to build and easier to maintain.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Direct BI tool access.&lt;/strong&gt; BI tools don&apos;t speak Data Vault natively. You always need a presentation layer on top, which means building two models instead of one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speed-to-value projects.&lt;/strong&gt; When the goal is &amp;quot;get a dashboard live this sprint,&amp;quot; Data Vault&apos;s up-front design work slows you down.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Data Vault&lt;/th&gt;
&lt;th&gt;Dimensional Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source flexibility&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Optional (SCDs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query simplicity&lt;/td&gt;
&lt;td&gt;Low (needs presentation layer)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning curve&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adding new sources&lt;/td&gt;
&lt;td&gt;Easy (new satellites)&lt;/td&gt;
&lt;td&gt;Harder (redesign dimensions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BI tool compatibility&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/09/data-vault-presentation.png&quot; alt=&quot;Presentation layer of star schema views built on top of a Data Vault foundation&quot;&gt;&lt;/p&gt;
&lt;p&gt;If you&apos;re evaluating Data Vault, start by counting your source systems and estimating how often they change schema. If the answer is &amp;quot;more than five sources&amp;quot; and &amp;quot;at least once a quarter,&amp;quot; Data Vault&apos;s separation of concerns will likely save you from painful redesign cycles. If your environment is simpler than that, a well-designed dimensional model will get you to production faster.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How a Self-Documenting Semantic Layer Reduces Data Team Toil</title><link>https://iceberglakehouse.com/posts/2026-02-sl-self-documenting-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-self-documenting-semantic-layer/</guid><description>
![Self-documenting semantic layer : AI generating descriptions and labels automatically](/assets/images/semantic_layer/09/self-documenting.png)

Ever...</description><pubDate>Wed, 18 Feb 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/09/self-documenting.png&quot; alt=&quot;Self-documenting semantic layer : AI generating descriptions and labels automatically&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every data team knows documentation is important. And almost every data team has a backlog of undocumented tables, unlabeled columns, and outdated descriptions that nobody has time to fix. The problem isn&apos;t motivation. It&apos;s that manual documentation doesn&apos;t scale.&lt;/p&gt;
&lt;p&gt;A self-documenting semantic layer changes the equation. Instead of asking humans to describe every column in every table, the platform generates descriptions automatically, suggests governance labels from data patterns, and propagates context through the view chain. Documentation becomes a byproduct of building the semantic layer, not a separate project.&lt;/p&gt;
&lt;h2&gt;The Documentation Problem Nobody Solves&lt;/h2&gt;
&lt;p&gt;Industry surveys consistently find that 70% or more of enterprise data assets are undocumented or poorly documented. The result: analysts spend 30-40% of their time searching for data and trying to understand what it means before they can start analyzing it.&lt;/p&gt;
&lt;p&gt;This isn&apos;t just a productivity problem. Undocumented data is a governance risk. A column named &lt;code&gt;status&lt;/code&gt; with values 0, 1, 2, and 3 could mean anything. An analyst guesses. An AI agent guesses worse. Nobody verifies. The wrong assumptions get baked into dashboards that drive business decisions.&lt;/p&gt;
&lt;p&gt;Data teams respond with documentation sprints. They burn a week writing Wiki pages for their top 50 tables. Two months later, half the descriptions are outdated because schemas have changed. The cycle repeats.&lt;/p&gt;
&lt;h2&gt;What Self-Documenting Actually Means&lt;/h2&gt;
&lt;p&gt;A self-documenting semantic layer generates and maintains documentation with minimal human effort. Three mechanisms work together:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI-generated descriptions&lt;/strong&gt;: The platform samples data in a table and generates human-readable descriptions for each column and the table itself.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Automated label suggestions&lt;/strong&gt;: The platform analyzes column names, data types, and value patterns to suggest governance labels (PII, Finance, Certified).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata propagation&lt;/strong&gt;: When a Silver view references a Bronze view, column descriptions flow downstream automatically. Documentation written once at the Bronze level appears everywhere the column is used.&lt;/p&gt;
&lt;p&gt;Human oversight is still essential. AI provides a 70% first draft. Data engineers add the domain-specific context that only they know: business rules, edge cases, known data quality issues. The point isn&apos;t to eliminate human documentation. It&apos;s to eliminate the blank page.&lt;/p&gt;
&lt;h2&gt;AI-Generated Descriptions&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/09/ai-doc-generation.png&quot; alt=&quot;AI scanning data tables and generating documentation automatically&quot;&gt;&lt;/p&gt;
&lt;p&gt;Modern semantic layer platforms can sample a table&apos;s data and generate meaningful descriptions automatically.&lt;/p&gt;
&lt;p&gt;Consider a column named &lt;code&gt;cltv&lt;/code&gt; in a table called &lt;code&gt;customers&lt;/code&gt;. The AI samples values (1200.50, 3400.00, 780.25), examines the column name and table context, and generates:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;cltv&lt;/strong&gt;: Customer Lifetime Value in USD. Represents the total revenue attributed to this customer from their first purchase to the current date, excluding refunded transactions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Not every generated description will be this precise. But most are useful enough to replace the current state: an empty description that tells the analyst nothing.&lt;/p&gt;
&lt;p&gt;More examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A column with values &amp;quot;US&amp;quot;, &amp;quot;UK&amp;quot;, &amp;quot;DE&amp;quot; → &amp;quot;ISO 3166 alpha-2 country code for the customer&apos;s billing address&amp;quot;&lt;/li&gt;
&lt;li&gt;A DATE column named &lt;code&gt;created_at&lt;/code&gt; in a &lt;code&gt;subscriptions&lt;/code&gt; table → &amp;quot;Date the subscription was created&amp;quot;&lt;/li&gt;
&lt;li&gt;A FLOAT column named &lt;code&gt;mrr&lt;/code&gt; → &amp;quot;Monthly Recurring Revenue in the account&apos;s base currency&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Automated Label Suggestions&lt;/h2&gt;
&lt;p&gt;Labels categorize data for governance and discovery. Manually tagging every column in a data warehouse with hundreds of tables is impractical. AI-based label suggestion makes it manageable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Columns containing email-like patterns (text with @ symbols) → suggested label: &lt;strong&gt;PII&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Columns with phone number patterns → suggested label: &lt;strong&gt;PII&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Columns named &lt;code&gt;price&lt;/code&gt;, &lt;code&gt;total&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, &lt;code&gt;revenue&lt;/code&gt; → suggested label: &lt;strong&gt;Finance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Columns in tables marked &amp;quot;Certified&amp;quot; → suggested label propagated to downstream views&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s approach&lt;/a&gt; combines these suggestions with human approval. The AI proposes labels. A data engineer reviews and accepts or rejects. Over time, the catalog fills up with accurate, useful labels without dedicated labeling sprints.&lt;/p&gt;
&lt;h2&gt;Metadata Propagation Through Views&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/09/metadata-propagation.png&quot; alt=&quot;Metadata flowing through Bronze, Silver, and Gold view layers&quot;&gt;&lt;/p&gt;
&lt;p&gt;In a well-designed semantic layer, documentation shouldn&apos;t need to be written more than once. The Bronze-Silver-Gold view architecture creates a natural propagation path:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Bronze layer&lt;/strong&gt;: Document the &lt;code&gt;CustomerID&lt;/code&gt; column as &amp;quot;Unique identifier for the customer, sourced from the CRM system.&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver layer&lt;/strong&gt;: A Silver view references &lt;code&gt;CustomerID&lt;/code&gt;. The description propagates automatically. No re-documentation needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold layer&lt;/strong&gt;: An aggregated Gold view groups by &lt;code&gt;CustomerID&lt;/code&gt;. The description carries through.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This propagation is especially valuable for join columns, filter columns, and commonly used dimensions that appear in dozens of views. Write the description once at the source, and it follows the column everywhere.&lt;/p&gt;
&lt;h2&gt;How This Reduces Toil&lt;/h2&gt;
&lt;p&gt;The impact on data team productivity is measurable:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Documentation Task&lt;/th&gt;
&lt;th&gt;Manual Approach&lt;/th&gt;
&lt;th&gt;Self-Documenting&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Column descriptions&lt;/td&gt;
&lt;td&gt;Write each by hand&lt;/td&gt;
&lt;td&gt;AI generates draft, human refines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance labels&lt;/td&gt;
&lt;td&gt;Manual tagging sprint&lt;/td&gt;
&lt;td&gt;AI suggests from data patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downstream view docs&lt;/td&gt;
&lt;td&gt;Re-write for each view&lt;/td&gt;
&lt;td&gt;Propagated from upstream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema change updates&lt;/td&gt;
&lt;td&gt;Manually check and update&lt;/td&gt;
&lt;td&gt;AI re-scans and flags changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New table onboarding&lt;/td&gt;
&lt;td&gt;Create from scratch&lt;/td&gt;
&lt;td&gt;AI generates baseline immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The net effect: documentation coverage goes from 30% (what the team could manage manually) to 80-90% (AI baseline + human refinement). The team spends hours instead of weeks on documentation. And the documentation stays current because the AI can re-scan when schemas change : flagging outdated descriptions instead of waiting for someone to notice.&lt;/p&gt;
&lt;p&gt;For AI agents, this improvement is material. A richer, more accurate semantic layer means the AI generates better SQL, hallucinates less, and requires fewer corrections. Self-documentation isn&apos;t just a productivity feature. It&apos;s an AI accuracy feature.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your most-used table. Open it in your data platform. How many columns have descriptions? How many have governance labels? If the answer is &amp;quot;not many,&amp;quot; calculate how long it would take to document the entire table manually. Then consider a platform that does 70% of that work for you.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Testing Data Pipelines: What to Validate and When</title><link>https://iceberglakehouse.com/posts/2026-02-debp-testing-data-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-testing-data-pipelines/</guid><description>
![Data pipeline testing pyramid with schema tests at the base, contract tests in the middle, and regression tests at the top](/assets/images/debp/08/...</description><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/08/testing-pyramid.png&quot; alt=&quot;Data pipeline testing pyramid with schema tests at the base, contract tests in the middle, and regression tests at the top&quot;&gt;&lt;/p&gt;
&lt;p&gt;Ask an application developer how they test their code and they&apos;ll describe unit tests, integration tests, CI/CD pipelines, and coverage metrics. Ask a data engineer the same question and the most common answer is: &amp;quot;we check the dashboard.&amp;quot;&lt;/p&gt;
&lt;p&gt;Data pipelines are software. They have inputs, logic, and outputs. They can have bugs. They can break silently. And unlike application bugs that trigger error pages, data bugs produce numbers that look plausible : until someone makes a business decision based on them.&lt;/p&gt;
&lt;h2&gt;Pipelines Are Software : They Need Tests&lt;/h2&gt;
&lt;p&gt;The bar for data pipeline testing shouldn&apos;t be lower than for application code. If anything, it should be higher. Application bugs are usually visible (broken UI, failed request). Data bugs are invisible (wrong aggregation, missing rows, stale values) and their impact compounds over time.&lt;/p&gt;
&lt;p&gt;Yet most data teams have no automated tests. They rely on manual spot-checks, analyst complaints, and hope. Testing a pipeline means catching problems before they reach consumers, not after.&lt;/p&gt;
&lt;h2&gt;The Testing Pyramid for Data&lt;/h2&gt;
&lt;p&gt;Borrow the testing pyramid from software engineering and adapt it for data:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Base: Schema and contract tests.&lt;/strong&gt; Fast, cheap, run on every pipeline execution. Does the output schema match what consumers expect? Do required columns exist? Are data types correct? These tests catch structural problems (dropped columns, type changes) immediately.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Middle: Data validation tests.&lt;/strong&gt; Check the values in the output. Are primary keys unique? Are required columns non-null? Do amounts, dates, and counts fall within valid ranges? These tests catch quality problems (duplicates, nulls, outliers) before they propagate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Top: Regression and integration tests.&lt;/strong&gt; Compare today&apos;s output to historical patterns. Did the row count change dramatically? Did the total revenue shift by more than 10%? These tests catch subtle logic errors and upstream data changes.&lt;/p&gt;
&lt;p&gt;Run more tests at the base (they&apos;re cheap and fast) and fewer at the top (they&apos;re expensive but comprehensive).&lt;/p&gt;
&lt;h2&gt;Schema and Contract Tests&lt;/h2&gt;
&lt;p&gt;Schema tests are the simplest and most impactful place to start. After every pipeline run, verify:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column existence.&lt;/strong&gt; Every expected column is present in the output. If a transformation accidentally drops a column, you want to know immediately : not when a downstream query fails.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data types.&lt;/strong&gt; Columns have their expected types. A revenue column that silently became a string will pass a NULL check but break calculations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Not-null constraints.&lt;/strong&gt; Required columns contain no nulls. An order table where &lt;code&gt;customer_id&lt;/code&gt; is null means the join to the customer table will silently lose rows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Uniqueness.&lt;/strong&gt; Primary key columns have no duplicates. Duplicate order IDs mean double-counted revenue.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Example schema and contract tests
-- Check for unexpected nulls
SELECT COUNT(*) AS null_count
FROM orders
WHERE order_id IS NULL OR customer_id IS NULL;

-- Check for duplicates
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) &amp;gt; 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/08/schema-tests.png&quot; alt=&quot;Schema test examples: column existence, type validation, null checks, uniqueness checks&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Runtime Data Validation&lt;/h2&gt;
&lt;p&gt;Schema tests verify structure. Data validation tests verify content. Run these after every pipeline execution, before marking the job as successful:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Range checks.&lt;/strong&gt; Numeric values fall within expected bounds. An order total of -$500 or $999,999,999 is likely a bug. Define acceptable ranges per column and flag outliers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Referential integrity.&lt;/strong&gt; Foreign keys reference existing records. An order with &lt;code&gt;product_id = 12345&lt;/code&gt; should correspond to a row in the products table. Missing references indicate either missing data or a pipeline timing issue.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Freshness checks.&lt;/strong&gt; The most recent event timestamp is within the expected window. If a daily pipeline&apos;s output contains no events from today, something went wrong : even if the job succeeded.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Volume checks.&lt;/strong&gt; Row counts fall within historical norms. A daily feed that normally produces 50,000 rows but arrives with 500 should trigger an alert. Use percentage thresholds (±20% from the trailing 7-day average) to avoid false positives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Custom business rules.&lt;/strong&gt; Domain-specific assertions. &amp;quot;Every invoice must have at least one line item.&amp;quot; &amp;quot;No employee should have a start date in the future.&amp;quot; These rules encode business knowledge that generic tests can&apos;t capture.&lt;/p&gt;
&lt;h2&gt;Regression and Anomaly Detection&lt;/h2&gt;
&lt;p&gt;Regression tests compare today&apos;s output to historical baselines:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Aggregate comparison.&lt;/strong&gt; Compare key metrics (total revenue, row count, distinct customer count) against the previous run. Deviations beyond a threshold (e.g., ±15%) may indicate an upstream change, a bug in new transformation logic, or missing source data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Distribution checks.&lt;/strong&gt; Compare the distribution of categorical columns (status values, country codes) against historical norms. A sudden spike in &amp;quot;unknown&amp;quot; status may indicate a schema change in the source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Trend analysis.&lt;/strong&gt; Track metrics over time. A gradual decline in row count over weeks may indicate a leak that daily checks miss.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/08/regression-testing.png&quot; alt=&quot;Regression testing: comparing aggregates, distributions, and trends over time&quot;&gt;&lt;/p&gt;
&lt;p&gt;Regression tests are more expensive to maintain because they require historical baselines and threshold tuning. Start simple (row count ± 20%) and refine as you learn what normal looks like.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Add three tests to your most critical pipeline today: a uniqueness check on the primary key, a null check on required columns, and a row count comparison against yesterday&apos;s output. Run them after every pipeline execution. These three tests alone will catch the majority of data problems before they reach consumers.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Denormalization: When and Why to Flatten Your Data</title><link>https://iceberglakehouse.com/posts/2026-02-dm-denormalization-when-why/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-denormalization-when-why/</guid><description>
![Normalized model with many interconnected tables vs. denormalized wide flat table](/assets/images/data_modeling/08/denormalization-overview.png)

N...</description><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/08/denormalization-overview.png&quot; alt=&quot;Normalized model with many interconnected tables vs. denormalized wide flat table&quot;&gt;&lt;/p&gt;
&lt;p&gt;Normalization is the first rule taught in database design. Eliminate redundancy. Store each fact once. Use foreign keys. It&apos;s the right rule for transactional systems. And it&apos;s the wrong rule for most analytics workloads.&lt;/p&gt;
&lt;p&gt;Denormalization is the deliberate introduction of redundancy into your data model to reduce joins and speed up queries. Done poorly, it creates a maintenance nightmare. Done well, it turns slow dashboards into fast ones and makes your data accessible to analysts and AI agents who can&apos;t write 12-table joins.&lt;/p&gt;
&lt;h2&gt;What Normalization Gives You (and What It Costs)&lt;/h2&gt;
&lt;p&gt;Normalization (Third Normal Form and beyond) organizes data so that each piece of information exists in exactly one place. A customer&apos;s city lives in the customers table. An order&apos;s product lives in the order_items table joined to the products table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What normalization gives you:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No update anomalies (change a city in one row, not thousands)&lt;/li&gt;
&lt;li&gt;Smaller storage footprint (no duplicated data)&lt;/li&gt;
&lt;li&gt;Strong data integrity (constraints enforced at the schema level)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;What normalization costs you:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;More joins per query (a report might join 10-15 tables)&lt;/li&gt;
&lt;li&gt;Slower read performance (each join adds latency)&lt;/li&gt;
&lt;li&gt;More complex SQL (longer queries, more error-prone)&lt;/li&gt;
&lt;li&gt;Harder self-service (analysts struggle with multi-join queries)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For an OLTP system processing 10,000 inserts per second, normalization is correct. For an OLAP system answering &amp;quot;revenue by region by quarter,&amp;quot; it&apos;s a performance bottleneck.&lt;/p&gt;
&lt;h2&gt;What Denormalization Actually Means&lt;/h2&gt;
&lt;p&gt;Denormalization takes several forms:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Embedding dimension attributes in fact tables.&lt;/strong&gt; Instead of joining &lt;code&gt;orders → customers&lt;/code&gt; to get the customer name, include &lt;code&gt;customer_name&lt;/code&gt; directly in the orders table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pre-joining lookup tables.&lt;/strong&gt; Instead of maintaining separate &lt;code&gt;cities&lt;/code&gt;, &lt;code&gt;states&lt;/code&gt;, and &lt;code&gt;countries&lt;/code&gt; tables, flatten them into a single column: &lt;code&gt;customer_city_state_country&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Adding calculated columns.&lt;/strong&gt; Instead of computing &lt;code&gt;quantity × price × (1 - discount)&lt;/code&gt; in every query, store &lt;code&gt;net_revenue&lt;/code&gt; as a pre-computed column.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Creating wide summary tables.&lt;/strong&gt; Instead of joining across 8 tables for a monthly report, create a &lt;code&gt;monthly_summary&lt;/code&gt; table with all needed columns in one place.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/08/denormalization-techniques.png&quot; alt=&quot;Denormalization techniques: embedding, pre-joining, calculating, and flattening into wide tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;The key insight: denormalization trades write-time simplicity for read-time simplicity. Updating a customer&apos;s city now requires updating it in multiple places. But querying revenue by city no longer requires a join.&lt;/p&gt;
&lt;h2&gt;When to Denormalize&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Analytics and reporting workloads.&lt;/strong&gt; If your model primarily serves dashboards, reports, and ad-hoc queries, denormalization reduces query time and complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-service environments.&lt;/strong&gt; Business users selecting fields in a BI tool get better results from a wide, flat table than from a web of normalized tables they don&apos;t understand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI-driven queries.&lt;/strong&gt; When an AI agent generates SQL, fewer tables and fewer joins reduce the chance of wrong join conditions and hallucinated relationships.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read-heavy, write-light patterns.&lt;/strong&gt; If your data loads once a day (batch ETL) and gets queried thousands of times, optimizing for reads makes sense.&lt;/p&gt;
&lt;h2&gt;When NOT to Denormalize&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;High-frequency transactional writes.&lt;/strong&gt; If your system processes real-time inserts and updates, denormalized redundancy creates update anomalies. A customer moving to a new city means updating hundreds of order rows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When consistency matters more than speed.&lt;/strong&gt; Financial systems with audit requirements often need the strict integrity that normalization provides.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Small datasets.&lt;/strong&gt; If the query joins 5 tables with 1,000 rows each, denormalization won&apos;t improve performance noticeably. The overhead of redundancy isn&apos;t worth the marginal speed gain.&lt;/p&gt;
&lt;h2&gt;The Tradeoffs&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fewer joins per query&lt;/td&gt;
&lt;td&gt;Update anomalies (same data in multiple places)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faster read performance&lt;/td&gt;
&lt;td&gt;Larger storage footprint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simpler SQL for analysts&lt;/td&gt;
&lt;td&gt;Pipeline complexity (keeping redundant data in sync)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Better BI tool compatibility&lt;/td&gt;
&lt;td&gt;Risk of inconsistency if pipelines fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agents write more accurate SQL&lt;/td&gt;
&lt;td&gt;More effort to maintain data quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Virtual Denormalization: The Middle Path&lt;/h2&gt;
&lt;p&gt;There&apos;s a way to get the query benefits of denormalization without the physical redundancy: SQL views.&lt;/p&gt;
&lt;p&gt;A view can join and flatten multiple normalized tables into a single logical table. Consumers query the view as if it&apos;s one wide table : simple SQL, no joins required. But the underlying data stays normalized. Update a customer&apos;s city in the customers table, and the view reflects the change automatically.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW v_orders_enriched AS
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    c.city AS customer_city,
    p.product_name,
    p.category AS product_category,
    o.quantity * o.unit_price AS revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Analysts query &lt;code&gt;v_orders_enriched&lt;/code&gt; without knowing the underlying structure. The join logic is defined once and reused by everyone.&lt;/p&gt;
&lt;p&gt;The tradeoff: views execute the joins at query time. For very large datasets, this can be slow. Platforms like &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; solve this with Reflections , which physically materialize the view&apos;s results in an optimized format, updated automatically. Users still query the logical view, but the engine substitutes the pre-computed Reflection for performance. You get the simplicity of denormalization, the consistency of normalization, and the speed of materialization.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/08/virtual-denormalization.png&quot; alt=&quot;Virtual view acting as a denormalized layer over normalized source tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;Identify your most-queried report or dashboard. Count the joins in the underlying SQL. If there are more than five, create a denormalized view that flattens the data. Compare query performance before and after. If the view is still too slow for your SLA, adding a materialized acceleration layer (like Reflections) closes the gap.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Headless BI: How a Universal Semantic Layer Replaces Tool-Specific Models</title><link>https://iceberglakehouse.com/posts/2026-02-sl-headless-bi-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-headless-bi-semantic-layer/</guid><description>
![Headless BI : one semantic layer serving all consumers](/assets/images/semantic_layer/08/headless-bi.png)

Your organization uses Tableau for execu...</description><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/08/headless-bi.png&quot; alt=&quot;Headless BI : one semantic layer serving all consumers&quot;&gt;&lt;/p&gt;
&lt;p&gt;Your organization uses Tableau for executive dashboards, Power BI for operational reports, and Python notebooks for data science. Revenue is defined in Tableau&apos;s calculated field, Power BI&apos;s DAX measure, and a SQL query inside a Jupyter notebook. Three tools. Three definitions. None of them match.&lt;/p&gt;
&lt;p&gt;This is what happens when semantic models are locked inside BI tools. Headless BI fixes it by pulling the definitions out.&lt;/p&gt;
&lt;h2&gt;The Problem with Tool-Specific Semantic Models&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/08/tool-lock-in.png&quot; alt=&quot;BI tool lock-in : metrics trapped in isolated silos&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every major BI tool comes with its own modeling layer. Looker has LookML. Tableau has the Data Model. Power BI has DAX and the tabular model. Each one defines metrics, relationships, and calculated fields in a proprietary format.&lt;/p&gt;
&lt;p&gt;This creates three problems:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition duplication.&lt;/strong&gt; Every metric must be defined in every tool. Revenue in Tableau. Revenue in Power BI. Revenue in the data science notebook. When the formula changes (say, a new exclusion rule is added), you update it in three places. Or you forget one, and your dashboards disagree.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tool lock-in.&lt;/strong&gt; Your metric definitions are trapped inside the tool&apos;s proprietary format. Switching from Tableau to a different visualization layer means rebuilding every metric from scratch. The data model doesn&apos;t migrate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI agent exclusion.&lt;/strong&gt; When you add an AI agent to your stack, it can&apos;t access the Looker LookML definitions or the Power BI DAX measures. It has no semantic model to work with. It generates SQL based on raw table schemas and gets the formulas wrong.&lt;/p&gt;
&lt;h2&gt;What Headless BI Means&lt;/h2&gt;
&lt;p&gt;Headless BI is an architecture pattern where metric definitions and business logic are decoupled from the visualization layer. The &amp;quot;head&amp;quot; (the dashboard or chart) is separate from the &amp;quot;body&amp;quot; (the semantic definitions).&lt;/p&gt;
&lt;p&gt;In a headless architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Metrics are defined once in a platform-neutral semantic layer&lt;/li&gt;
&lt;li&gt;Definitions are exposed via standard interfaces: SQL, JDBC, ODBC, Arrow Flight, REST&lt;/li&gt;
&lt;li&gt;Any tool :  Tableau, Power BI, Python, an AI agent, a custom app ,  connects to the same definitions&lt;/li&gt;
&lt;li&gt;Adding a new visualization tool requires zero metric migration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The semantic layer becomes a shared service. Visualization tools consume it. They don&apos;t own it.&lt;/p&gt;
&lt;h2&gt;Tool-Specific vs. Universal Semantic Layer&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Tool-Specific Model&lt;/th&gt;
&lt;th&gt;Universal Semantic Layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Where metrics are defined&lt;/td&gt;
&lt;td&gt;Inside each BI tool&lt;/td&gt;
&lt;td&gt;Centralized, tool-independent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Number of Revenue definitions&lt;/td&gt;
&lt;td&gt;One per tool&lt;/td&gt;
&lt;td&gt;One total&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Formula change process&lt;/td&gt;
&lt;td&gt;Update every tool&lt;/td&gt;
&lt;td&gt;Update once, propagates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New tool onboarding&lt;/td&gt;
&lt;td&gt;Rebuild all definitions&lt;/td&gt;
&lt;td&gt;Connect and query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agent access&lt;/td&gt;
&lt;td&gt;No (locked in BI format)&lt;/td&gt;
&lt;td&gt;Yes (standard SQL interface)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portability&lt;/td&gt;
&lt;td&gt;Vendor-locked&lt;/td&gt;
&lt;td&gt;Open and interoperable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;What Composable Analytics Looks Like&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/08/composable-analytics.png&quot; alt=&quot;Composable analytics : modular blocks snapping together&quot;&gt;&lt;/p&gt;
&lt;p&gt;Headless BI is one piece of a broader shift called &lt;strong&gt;composable analytics&lt;/strong&gt;. Instead of buying a monolithic BI platform that bundles data modeling, metric definitions, and visualizations together, you assemble your analytics stack from modular, interchangeable components.&lt;/p&gt;
&lt;p&gt;The semantic layer is the metric module. Choose any visualization tool on top. Choose any data storage underneath. Swap components without rebuilding definitions.&lt;/p&gt;
&lt;p&gt;This modularity matters most for AI. An AI agent becomes a first-class consumer of the semantic layer, alongside dashboards and notebooks. It connects to the same interface, reads the same metric definitions, and gets the same answers. No special integration needed.&lt;/p&gt;
&lt;h2&gt;How This Works in Practice&lt;/h2&gt;
&lt;p&gt;Dremio functions as a universal semantic layer that any tool can consume. The architecture:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Virtual datasets (SQL views)&lt;/strong&gt; define business logic and metric calculations once&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wikis and Labels&lt;/strong&gt; document business context for human and AI consumers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-Grained Access Control&lt;/strong&gt; enforces security policies at the query level&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt; optimize performance automatically for any consumer&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Connection options include ODBC, JDBC, Arrow Flight (for columnar high-speed clients), and REST API. A Tableau dashboard connects via ODBC. A Python notebook connects via Arrow Flight. Dremio&apos;s AI Agent &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;reads the Wikis and Labels&lt;/a&gt; to generate accurate SQL from natural language. All three hit the same virtual datasets. All three get the same answers.&lt;/p&gt;
&lt;p&gt;Because the entire semantic layer is built on open standards (Apache Iceberg for data, Apache Polaris for the catalog), the definitions aren&apos;t locked to Dremio&apos;s format. You can inspect, export, and query the same data with any Iceberg-compatible engine.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Count the number of places your organization defines its top metric (probably Revenue or Monthly Active Users). If that number is greater than one, you&apos;re paying a consistency tax every time someone changes the formula. A universal semantic layer eliminates that tax by defining it once and serving it everywhere.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Partition and Organize Data for Performance</title><link>https://iceberglakehouse.com/posts/2026-02-debp-partition-and-organize/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-partition-and-organize/</guid><description>
![Table data split into partitions by date with query scanning only the relevant partition](/assets/images/debp/07/partition-overview.png)

A table w...</description><pubDate>Wed, 18 Feb 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/07/partition-overview.png&quot; alt=&quot;Table data split into partitions by date with query scanning only the relevant partition&quot;&gt;&lt;/p&gt;
&lt;p&gt;A table with 500 million rows takes 45 seconds to query. After partitioning it by date, the same query :  filtering on a single day ,  returns in 2 seconds. The SQL didn&apos;t change. The data didn&apos;t change. The only thing that changed was how the data was organized on disk.&lt;/p&gt;
&lt;p&gt;Performance in analytical workloads is almost never about faster hardware. It&apos;s about reading less data.&lt;/p&gt;
&lt;h2&gt;Read Less Data, Run Faster Queries&lt;/h2&gt;
&lt;p&gt;Analytical query engines scan data to answer queries. A full table scan reads every row, every column. But most queries only need a fraction of the data: this week&apos;s transactions, this region&apos;s customers, this product category&apos;s sales.&lt;/p&gt;
&lt;p&gt;Partitioning and data organization let the engine skip irrelevant data entirely:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partition pruning.&lt;/strong&gt; The engine reads only the partitions that match the query&apos;s WHERE clause.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column pruning.&lt;/strong&gt; Columnar formats (Parquet, ORC) read only the requested columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Predicate pushdown.&lt;/strong&gt; Min/max statistics in file metadata let the engine skip files whose value ranges don&apos;t match the filter.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Combined, these techniques can reduce the data scanned from terabytes to megabytes. The fastest query is the one that reads the least data.&lt;/p&gt;
&lt;h2&gt;Partitioning Strategies&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Time-based partitioning.&lt;/strong&gt; Partition by date, hour, or month. This is the most common strategy because most analytical queries filter by time. A daily partition structure means a query for &amp;quot;last week&amp;quot; reads 7 partitions instead of scanning the entire table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Value-based partitioning.&lt;/strong&gt; Partition by a categorical column: region, source system, customer tier. This works when queries consistently filter on that column. A multi-tenant application might partition by tenant ID so each tenant&apos;s queries touch only their data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hash-based partitioning.&lt;/strong&gt; Distribute data evenly across N buckets using a hash function on a key column. This is useful for join-heavy workloads: two tables hashed on the same join key can be joined partition-to-partition without shuffling data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Composite partitioning.&lt;/strong&gt; Combine strategies: partition by date, then bucket by customer ID within each date. This handles queries that filter on date and join on customer ID.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choosing the right strategy:&lt;/strong&gt; Look at your most frequent queries. What columns appear in WHERE clauses and JOIN conditions? Those are your partition candidates. If 90% of queries filter by date, partition by date.&lt;/p&gt;
&lt;h2&gt;File-Level Organization&lt;/h2&gt;
&lt;p&gt;Partitioning controls which directory the query engine reads. File-level organization controls how efficiently it reads within that directory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sorting.&lt;/strong&gt; Sort rows within each file by a frequently filtered column. If queries often filter &lt;code&gt;WHERE status = &apos;active&apos;&lt;/code&gt;, sorting by status clusters active rows together. The engine reads min/max metadata, sees that a file&apos;s status range is only &apos;active&apos;, and skips files that don&apos;t match.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/07/sorted-files.png&quot; alt=&quot;Sorted data within partitions enabling file-level skip based on min/max metadata&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Z-ordering.&lt;/strong&gt; When queries filter on multiple columns, linear sorting optimizes for only one. Z-ordering interleaves the sort order across multiple columns, enabling predicate pushdown on any combination of the Z-ordered columns. It&apos;s especially effective for 2-3 column filter combinations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File sizing.&lt;/strong&gt; Target file sizes between 128 MB and 1 GB. Files too small (&amp;lt; 10 MB) create metadata overhead and excessive file-open operations. Files too large (&amp;gt; 2 GB) reduce parallelism and waste I/O when only a fraction of the file is needed.&lt;/p&gt;
&lt;h2&gt;Compaction: The Maintenance Task You Can&apos;t Skip&lt;/h2&gt;
&lt;p&gt;Streaming writes and frequent small batch appends create many small files. A partition with 10,000 files of 1 MB each is dramatically slower to query than the same data in 10 files of 1 GB each.&lt;/p&gt;
&lt;p&gt;Compaction merges small files into optimally-sized files. It&apos;s the data equivalent of defragmenting a disk.&lt;/p&gt;
&lt;p&gt;Run compaction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After streaming writes accumulate small files&lt;/li&gt;
&lt;li&gt;After many small batch appends&lt;/li&gt;
&lt;li&gt;On a regular schedule (daily or weekly) for active partitions&lt;/li&gt;
&lt;li&gt;Targeted at partitions where file counts exceed a threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compaction also provides an opportunity to re-sort data within files, clean up deleted records (in formats that use soft deletes like Iceberg and Delta), and update file-level statistics.&lt;/p&gt;
&lt;h2&gt;Common Partitioning Mistakes&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Over-partitioning.&lt;/strong&gt; Partitioning by a high-cardinality column (user ID, transaction ID) creates millions of partitions, each with a few rows. The engine spends more time listing and opening files than reading data. Rule of thumb: keep individual partition sizes above 100 MB.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Under-partitioning.&lt;/strong&gt; A single partition for the entire table means every query scans everything. If your table has billions of rows and no partitions, even simple queries are slow.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misaligned partitions.&lt;/strong&gt; Partitioning by month when every query filters by day means the engine reads an entire month&apos;s data for a single-day query. Align partition granularity with query granularity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ignoring compaction.&lt;/strong&gt; Streaming into a table without compacting creates the small-file problem. Query performance degrades gradually until someone notices. Schedule compaction as part of pipeline maintenance.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/07/partition-mistakes.png&quot; alt=&quot;Common mistakes: too many partitions, wrong partition key, no compaction&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Identify your slowest analytical query. Check the table&apos;s partitioning strategy. If the table has no partitions, add one aligned with the query&apos;s most common WHERE clause. If it&apos;s already partitioned, check file sizes : if the average file is under 10 MB, run compaction. Measure before and after.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling for Analytics: Optimize for Queries, Not Transactions</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-for-analytics/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-for-analytics/</guid><description>
![OLTP normalized model vs. OLAP denormalized model side by side](/assets/images/data_modeling/07/analytics-data-modeling.png)

The data model that r...</description><pubDate>Wed, 18 Feb 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/07/analytics-data-modeling.png&quot; alt=&quot;OLTP normalized model vs. OLAP denormalized model side by side&quot;&gt;&lt;/p&gt;
&lt;p&gt;The data model that runs your production application is almost never the right model for analytics. Transactional systems are designed for fast writes :  inserting orders, updating inventory, processing payments. Analytics systems are designed for fast reads ,  scanning millions of rows, aggregating across dimensions, filtering by date ranges.&lt;/p&gt;
&lt;p&gt;Using a transactional model for analytics is like using a filing cabinet when you need a search engine. The data is there, but finding answers takes too long.&lt;/p&gt;
&lt;h2&gt;Transactions vs. Analytics: Two Different Problems&lt;/h2&gt;
&lt;p&gt;Transactional (OLTP) workloads process many small operations: insert one order, update one account balance, delete one expired session. These models are normalized to Third Normal Form (3NF) or beyond : every piece of data stored once, redundancy eliminated, consistency enforced through constraints.&lt;/p&gt;
&lt;p&gt;Analytical (OLAP) workloads process few large operations: scan all orders for the last year, aggregate revenue by region and product category, calculate year-over-year growth. These models are denormalized : data is pre-joined, attributes are flattened, and the structure is optimized for scans rather than updates.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;OLTP Model&lt;/th&gt;
&lt;th&gt;OLAP Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Optimization target&lt;/td&gt;
&lt;td&gt;Write speed&lt;/td&gt;
&lt;td&gt;Read speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normalization&lt;/td&gt;
&lt;td&gt;3NF or higher&lt;/td&gt;
&lt;td&gt;Denormalized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table structure&lt;/td&gt;
&lt;td&gt;Narrow and many&lt;/td&gt;
&lt;td&gt;Wide and few&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins per query&lt;/td&gt;
&lt;td&gt;Many (10-20)&lt;/td&gt;
&lt;td&gt;Few (3-5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage format&lt;/td&gt;
&lt;td&gt;Row-oriented&lt;/td&gt;
&lt;td&gt;Columnar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical query&lt;/td&gt;
&lt;td&gt;UPDATE one row&lt;/td&gt;
&lt;td&gt;SUM across millions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Why Normalized Models Slow Down Analytics&lt;/h2&gt;
&lt;p&gt;A normalized 3NF model might have 15 tables involved in answering &amp;quot;What was revenue by product category by month?&amp;quot; The query engine must join orders to order_items to products to categories to dates, applying filters and aggregations across each join.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/07/normalized-vs-denormalized-query.png&quot; alt=&quot;Chain of joins through normalized tables versus one wide scan through a denormalized table&quot;&gt;&lt;/p&gt;
&lt;p&gt;Each join adds latency. Each join also adds a point of failure : wrong join condition, missing foreign key, ambiguous column name. An AI agent generating SQL against a 15-table normalized model has far more opportunities to make a mistake than against a 4-table star schema.&lt;/p&gt;
&lt;p&gt;The fix is not to abandon normalization. Keep your OLTP model normalized for your application. But create a separate analytical model :  denormalized, structured for queries, with pre-built joins and business-friendly column names ,  for reporting and analytics.&lt;/p&gt;
&lt;h2&gt;Designing for Read Performance&lt;/h2&gt;
&lt;p&gt;Analytical data models follow several patterns that optimize for read performance:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wide tables reduce joins.&lt;/strong&gt; Instead of &lt;code&gt;orders → customers → addresses → cities → states&lt;/code&gt;, create a single &lt;code&gt;fact_orders&lt;/code&gt; view with &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;customer_city&lt;/code&gt;, &lt;code&gt;customer_state&lt;/code&gt; included. Every join you eliminate saves query time and reduces complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pre-computed columns reduce repeated calculations.&lt;/strong&gt; If every report calculates &lt;code&gt;quantity * unit_price * (1 - discount)&lt;/code&gt; as &amp;quot;net revenue,&amp;quot; compute it once in the model and expose it as a column. This eliminates repeated formula definitions and ensures consistency.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistent naming improves discoverability.&lt;/strong&gt; Use &lt;code&gt;order_date&lt;/code&gt; instead of &lt;code&gt;dt&lt;/code&gt;. Use &lt;code&gt;customer_email&lt;/code&gt; instead of &lt;code&gt;email&lt;/code&gt;. When column names are self-explanatory, analysts find the right data faster, and AI agents generate more accurate SQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date dimensions enable time-based analysis.&lt;/strong&gt; A date dimension with &lt;code&gt;fiscal_quarter&lt;/code&gt;, &lt;code&gt;is_weekend&lt;/code&gt;, &lt;code&gt;is_holiday&lt;/code&gt;, and &lt;code&gt;week_of_year&lt;/code&gt; makes time-based filtering trivial. Without it, every analyst writes a different &lt;code&gt;CASE WHEN MONTH(date) IN (1,2,3) THEN &apos;Q1&apos;&lt;/code&gt; expression.&lt;/p&gt;
&lt;h2&gt;Pre-Aggregation and Summary Tables&lt;/h2&gt;
&lt;p&gt;Not every query needs to scan raw data. For frequently run aggregations, pre-aggregated summary tables reduce query time from minutes to milliseconds.&lt;/p&gt;
&lt;p&gt;Common patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily summary&lt;/strong&gt;: Total revenue, order count, average order value per day per product category&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly snapshot&lt;/strong&gt;: Active customers, churned customers, MRR per segment&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rolling window&lt;/strong&gt;: 7-day and 30-day moving averages for key metrics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tradeoff is maintenance. Every summary table needs a refresh pipeline, and stale summaries produce outdated numbers.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; handle this automatically with Reflections : pre-computed aggregations and materializations that the query optimizer uses transparently. Users query the logical views; Dremio substitutes the fastest Reflection without the user knowing. No manual summary table management required.&lt;/p&gt;
&lt;h2&gt;Columnar Storage and Physical Layout&lt;/h2&gt;
&lt;p&gt;Analytics models benefit from columnar storage formats like Parquet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column pruning&lt;/strong&gt;: Queries that touch 5 of 50 columns only read those 5 columns from disk&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression&lt;/strong&gt;: Repeated values in a column (category names, status codes) compress efficiently&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized processing&lt;/strong&gt;: Engines like Dremio (built on Apache Arrow) process columnar data in CPU-cache-friendly batches&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Physical layout decisions that matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partition by time&lt;/strong&gt;: Most analytics queries filter by date range. Partitioning by month or day lets the engine skip irrelevant data files entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sort by high-cardinality filters&lt;/strong&gt;: If queries frequently filter by &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;region&lt;/code&gt;, sorting data within partitions enables min/max pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compact regularly&lt;/strong&gt;: Small files from streaming inserts slow down scan performance. Compaction rewrites small files into larger, optimized ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/07/analytics-architecture.png&quot; alt=&quot;Analytics model with wide tables, pre-aggregations, and columnar storage feeding dashboards&quot;&gt;&lt;/p&gt;
&lt;p&gt;Find your slowest dashboard. Look at the queries behind it. Count the joins, measure the scan size, and check whether the model is normalized 3NF or denormalized for analytics. If it&apos;s still using the transactional model, create an analytical view layer on top : a denormalized star schema with pre-computed columns, clear naming, and a date dimension. The dashboard performance improvement is usually immediate and significant.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Virtualization and the Semantic Layer: Query Without Copying</title><link>https://iceberglakehouse.com/posts/2026-02-sl-data-virtualization-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-data-virtualization-semantic-layer/</guid><description>
![Data virtualization : connecting sources to a unified semantic layer without copying](/assets/images/semantic_layer/07/data-virtualization.png)

Ev...</description><pubDate>Wed, 18 Feb 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/07/data-virtualization.png&quot; alt=&quot;Data virtualization : connecting sources to a unified semantic layer without copying&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every data pipeline you build to move data from one system to another costs you three things: time to build it, money to run it, and freshness you lose while waiting for the next sync. Most analytics architectures accept this cost as unavoidable. It isn&apos;t.&lt;/p&gt;
&lt;p&gt;Data virtualization eliminates the movement. A semantic layer adds meaning and governance on top. Together, they give you a complete analytics layer over distributed data without copying a single table.&lt;/p&gt;
&lt;h2&gt;The Data Movement Tax&lt;/h2&gt;
&lt;p&gt;Traditional analytics architecture looks like this: data lives in operational databases, SaaS tools, and cloud storage. To analyze it, you extract it, transform it, and load it into a central warehouse. Every source gets an ETL pipeline. Every pipeline needs monitoring, error handling, and scheduling.&lt;/p&gt;
&lt;p&gt;The result: your analytics are always behind your operational data. The warehouse reflects what happened as of the last sync, not what&apos;s happening now. You pay for storage in both the source and the warehouse. And when you add a new source, you add a new pipeline.&lt;/p&gt;
&lt;p&gt;This model made sense when compute was expensive and storage was local. In a cloud-native world where compute is elastic and storage is cheap, the calculus changes.&lt;/p&gt;
&lt;h2&gt;What Data Virtualization Does&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/07/etl-vs-virtual.png&quot; alt=&quot;ETL pipelines vs. data virtualization : physical movement vs. lightweight connections&quot;&gt;&lt;/p&gt;
&lt;p&gt;Data virtualization lets you query data where it lives. Instead of copying data to a central location, you connect to each source and issue queries directly. A virtualization engine translates your SQL into the source&apos;s native protocol (JDBC for databases, S3 API for object storage, REST for SaaS), retrieves the data, and combines results from multiple sources into a single result set.&lt;/p&gt;
&lt;p&gt;From the user&apos;s perspective, all data appears in one unified namespace. A PostgreSQL production database, an S3 data lake full of Parquet files, and a Snowflake analytics warehouse all look like tables in the same catalog.&lt;/p&gt;
&lt;p&gt;The keyword is &amp;quot;no replication.&amp;quot; The data stays where it is. The queries go to the data, not the other way around.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Adds on Top&lt;/h2&gt;
&lt;p&gt;Virtualization solves the access problem. But access without context is dangerous. Raw access to 50 federated sources means 50 sources where analysts can write conflicting metric formulas, join tables incorrectly, and query sensitive columns without authorization.&lt;/p&gt;
&lt;p&gt;A semantic layer added on top of virtualization provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;: &amp;quot;Revenue&amp;quot; is calculated the same way regardless of which source the data comes from&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Wikis describe what each federated table and column represent in business terms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Join paths&lt;/strong&gt;: Pre-defined relationships prevent analysts from guessing how tables connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access policies&lt;/strong&gt;: Row-level security and column masking enforced at the view level, even for sources that have no fine-grained access controls of their own&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The combination is powerful: you get real-time access to all your data (virtualization) with consistent meaning and governance (semantic layer), and without data movement (no ETL).&lt;/p&gt;
&lt;h2&gt;Why They&apos;re Stronger Together&lt;/h2&gt;
&lt;p&gt;Each technology is useful alone. Together, they cover gaps neither can fill individually:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Virtualization Only&lt;/th&gt;
&lt;th&gt;Semantic Layer Only&lt;/th&gt;
&lt;th&gt;Both Together&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Access distributed data&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (limited to centralized data)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business definitions&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance enforcement&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero data movement&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time access&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Depends on data freshness&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unified namespace&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Virtualization without a semantic layer gives you raw SQL access to everything. Powerful for engineers. Risky for an organization. No metric consistency, no governance, no documentation.&lt;/p&gt;
&lt;p&gt;A semantic layer without virtualization covers only the data that&apos;s been moved to the platform&apos;s native storage. Every source that hasn&apos;t been ETL&apos;d is invisible to the layer. You get great governance over a subset of your data, and no governance over the rest.&lt;/p&gt;
&lt;h2&gt;How It Works in Practice&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-agentic-analytics-requires-federation-virtualization-and-the-lakehouse-how-dremio-delivers/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; is built on this architecture natively. It combines a high-performance virtualization engine (supporting 30+ source types including S3, ADLS, PostgreSQL, MySQL, MongoDB, Snowflake, and Redshift) with a full semantic layer (virtual datasets, Wikis, Labels, Fine-Grained Access Control).&lt;/p&gt;
&lt;p&gt;A practical query flow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;An analyst queries &lt;code&gt;business.revenue_by_region&lt;/code&gt; : a virtual dataset (view)&lt;/li&gt;
&lt;li&gt;Dremio&apos;s optimizer determines that this view joins data from PostgreSQL (customer records) and S3/Iceberg (order transactions)&lt;/li&gt;
&lt;li&gt;Predicate pushdowns push filter logic to each source (e.g., date range filters applied at the source)&lt;/li&gt;
&lt;li&gt;Results are combined using Apache Arrow&apos;s columnar format (zero serialization overhead)&lt;/li&gt;
&lt;li&gt;Row-level security filters the results based on the analyst&apos;s role&lt;/li&gt;
&lt;li&gt;If a Reflection (pre-computed copy) exists, Dremio substitutes it transparently for faster performance&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The analyst sees one table. Behind it, two sources, one semantic layer, and automatic performance optimization.&lt;/p&gt;
&lt;h2&gt;When to Virtualize vs. When to Materialize&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/07/virtualize-materialize.png&quot; alt=&quot;Virtualize vs. materialize decision framework&quot;&gt;&lt;/p&gt;
&lt;p&gt;Not every query should hit the source directly. The right architecture uses both strategies:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Virtualize when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The data changes frequently and freshness matters&lt;/li&gt;
&lt;li&gt;The dataset is queried infrequently (monthly reports, ad-hoc exploration)&lt;/li&gt;
&lt;li&gt;Compliance requires data to stay in its source system&lt;/li&gt;
&lt;li&gt;You&apos;re evaluating a new source before committing to a pipeline&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Materialize when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple dashboards query the same dataset hundreds of times daily&lt;/li&gt;
&lt;li&gt;Joins across sources are slow because of network latency&lt;/li&gt;
&lt;li&gt;Table-level optimizations (compaction, partitioning, clustering) would improve performance&lt;/li&gt;
&lt;li&gt;AI workloads need scan-heavy access to large datasets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical strategy: start every source as a federated (virtual) connection. Monitor query frequency and performance. When a dataset crosses the line into &amp;quot;queried daily by multiple teams,&amp;quot; materialize it as an Apache Iceberg table. Dremio&apos;s Reflections automate this for the most common query patterns, creating materialized copies that the optimizer uses transparently.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Count your current ETL pipelines. For each one, ask: does the destination system need a physical copy of this data, or does it just need to query it? Every pipeline that exists purely for query access is a candidate for virtualization. Replace the pipeline with a federated connection, add a semantic layer for context, and watch your infrastructure costs drop.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Batch vs. Streaming: Choose the Right Processing Model</title><link>https://iceberglakehouse.com/posts/2026-02-debp-batch-vs-streaming/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-batch-vs-streaming/</guid><description>
![Batch processing in scheduled groups vs streaming in continuous flow](/assets/images/debp/06/batch-vs-streaming.png)

&quot;We need real-time data.&quot; Thi...</description><pubDate>Wed, 18 Feb 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/06/batch-vs-streaming.png&quot; alt=&quot;Batch processing in scheduled groups vs streaming in continuous flow&quot;&gt;&lt;/p&gt;
&lt;p&gt;&amp;quot;We need real-time data.&amp;quot; This is one of the most expensive sentences in data engineering : because it&apos;s rarely true, and implementing it when it&apos;s not needed multiplies complexity, cost, and operational burden.&lt;/p&gt;
&lt;p&gt;The question isn&apos;t &amp;quot;should we use streaming?&amp;quot; The question is &amp;quot;how fresh does the data actually need to be, and what are we willing to pay for that freshness?&amp;quot;&lt;/p&gt;
&lt;h2&gt;The Question Isn&apos;t &amp;quot;Real-Time or Not&amp;quot; : It&apos;s &amp;quot;How Fresh?&amp;quot;&lt;/h2&gt;
&lt;p&gt;Freshness requirements exist on a spectrum:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily&lt;/strong&gt; (24-hour latency): Fine for financial reporting, historical trend analysis, ML training datasets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hourly&lt;/strong&gt; (1-hour latency): Adequate for operational dashboards, inventory tracking, marketing attribution&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Near-real-time&lt;/strong&gt; (1-15 minutes): Sufficient for user activity feeds, recommendation updates, alerting&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time&lt;/strong&gt; (sub-second): Required for fraud detection, stock trading, IoT safety systems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most &amp;quot;we need real-time&amp;quot; requests are actually &amp;quot;we need hourly&amp;quot; or &amp;quot;we need 5-minute&amp;quot; requests. Clarifying the actual latency requirement before choosing an architecture prevents overengineering.&lt;/p&gt;
&lt;h2&gt;When Batch Wins&lt;/h2&gt;
&lt;p&gt;Batch processing is the default choice. Choose it unless you have a specific, justified reason to stream.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Simpler failure recovery.&lt;/strong&gt; A batch job fails at 3 AM. You fix the bug, rerun the job, and it reprocesses the same bounded dataset. Recovery is predictable and testable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Easier testing.&lt;/strong&gt; Given input dataset X, the output should be Y. You can version test datasets, run them locally, and assert exact outputs. Streaming test scenarios require simulating time, ordering, and late-arriving events : dramatically harder.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lower operational cost.&lt;/strong&gt; Batch jobs run on schedule, consume resources during execution, and release them when done. Streaming jobs run continuously, consuming resources 24/7 even during low-volume periods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Better tooling maturity.&lt;/strong&gt; SQL-based transformations, orchestrators with DAG visualization, version-controlled dbt models : the batch ecosystem is deeper and more mature for most data warehouse workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Daily/hourly analytics, data warehouse loading, ML training data, compliance reporting, historical backfills.&lt;/p&gt;
&lt;h2&gt;When Streaming Wins&lt;/h2&gt;
&lt;p&gt;Streaming processing is the right choice when latency is measured in seconds and the cost of stale data is high.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fraud detection.&lt;/strong&gt; You can&apos;t batch-process credit card transactions once an hour. By the time you detect a fraudulent pattern, thousands of dollars are already gone. Fraud detection needs event-by-event evaluation in real time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IoT and safety systems.&lt;/strong&gt; A temperature sensor in a chemical plant detecting an abnormal reading can&apos;t wait for the next hourly batch. Alerting must happen in seconds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Real-time personalization.&lt;/strong&gt; Showing a user recommendations based on what they did 30 seconds ago requires streaming user events through a recommendation engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational systems.&lt;/strong&gt; Inventory management, ride-sharing pricing, and live logistics tracking all need sub-minute data freshness to function correctly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Event-driven business logic, sub-second alerting, real-time user-facing features.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/06/latency-spectrum.png&quot; alt=&quot;Spectrum from batch to streaming with example use cases at each latency level&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Micro-Batch Middle Ground&lt;/h2&gt;
&lt;p&gt;Micro-batch processing runs batch jobs at very short intervals : every 1, 5, or 15 minutes. It captures most of the value of streaming with the simplicity of batch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Same tools, shorter intervals.&lt;/strong&gt; Your existing batch infrastructure (SQL transformations, orchestrators, testing frameworks) works unchanged. You just schedule runs more frequently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Most use cases are satisfied.&lt;/strong&gt; An operational dashboard refreshing every 5 minutes feels &amp;quot;real-time&amp;quot; to most business users. Marketing attribution updating every 15 minutes is fresh enough for campaign optimization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Significantly lower complexity.&lt;/strong&gt; No stream processing framework to learn. No state management. No watermark configuration. No event ordering challenges.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The tradeoff:&lt;/strong&gt; Micro-batch cannot achieve sub-second latency. If you genuinely need event-by-event processing under one second, you need a streaming framework.&lt;/p&gt;
&lt;h2&gt;A Decision Framework&lt;/h2&gt;
&lt;p&gt;Before choosing between batch, micro-batch, and streaming, answer these questions:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Batch&lt;/th&gt;
&lt;th&gt;Micro-batch&lt;/th&gt;
&lt;th&gt;Streaming&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Required latency&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost of stale data&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team streaming expertise&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational budget&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery complexity&lt;/td&gt;
&lt;td&gt;Simple rerun&lt;/td&gt;
&lt;td&gt;Simple rerun&lt;/td&gt;
&lt;td&gt;Complex&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Start with batch.&lt;/strong&gt; If stakeholders say &amp;quot;we need real-time,&amp;quot; ask &amp;quot;what&apos;s the cost of a 15-minute delay?&amp;quot; If the answer is &amp;quot;that&apos;s fine,&amp;quot; micro-batch gives you near-real-time at batch-level complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Upgrade to streaming only when justified.&lt;/strong&gt; Sub-second latency requirements, event-driven business logic, and high-volume event processing are legitimate streaming use cases. &amp;quot;I want the dashboard to update faster&amp;quot; is usually not.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/06/decision-framework.png&quot; alt=&quot;Decision framework: start batch, upgrade to micro-batch, stream only when sub-second needed&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;List every pipeline in your platform and categorize it by actual (not requested) latency requirement. You&apos;ll likely find that 80% or more of your workloads are well-served by batch or micro-batch. Focus streaming investment on the 20% that genuinely needs it.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Slowly Changing Dimensions: Types 1-3 with Examples</title><link>https://iceberglakehouse.com/posts/2026-02-dm-slowly-changing-dimensions/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-slowly-changing-dimensions/</guid><description>
![Dimension timeline showing attribute values changing across time periods](/assets/images/data_modeling/06/slowly-changing-dimensions.png)

Dimensio...</description><pubDate>Wed, 18 Feb 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/06/slowly-changing-dimensions.png&quot; alt=&quot;Dimension timeline showing attribute values changing across time periods&quot;&gt;&lt;/p&gt;
&lt;p&gt;Dimensions change. A customer moves cities. A product gets reclassified. An employee changes departments. How your data model handles these changes determines whether your historical reports are accurate or misleading.&lt;/p&gt;
&lt;p&gt;Slowly Changing Dimensions (SCDs) are design patterns for managing dimension attribute changes over time. The three most common types :  overwrite, track history, and track one change ,  each make a different tradeoff between simplicity and historical accuracy.&lt;/p&gt;
&lt;h2&gt;Why Dimensions Change&lt;/h2&gt;
&lt;p&gt;Dimension tables store descriptive attributes: customer addresses, product categories, employee titles. These attributes don&apos;t stay constant. A customer who was in &amp;quot;New York&amp;quot; last quarter is now in &amp;quot;Chicago.&amp;quot; A product that was in &amp;quot;Accessories&amp;quot; is now in &amp;quot;Electronics.&amp;quot;&lt;/p&gt;
&lt;p&gt;If your fact table recorded sales tied to that customer, do last quarter&apos;s reports show &amp;quot;New York&amp;quot; (where the customer was at the time of the sale) or &amp;quot;Chicago&amp;quot; (where the customer is now)? The answer depends on your SCD type.&lt;/p&gt;
&lt;h2&gt;Type 1: Overwrite the Old Value&lt;/h2&gt;
&lt;p&gt;Type 1 updates the dimension row in place. The old value is gone.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE dim_customers
SET city = &apos;Chicago&apos;
WHERE customer_id = 1042;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After this update, every historical fact associated with customer 1042 now appears under &amp;quot;Chicago&amp;quot; : including sales that happened when the customer was in New York.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use Type 1:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Correcting errors (fixing a misspelled name)&lt;/li&gt;
&lt;li&gt;When historical accuracy for that attribute doesn&apos;t matter&lt;/li&gt;
&lt;li&gt;When the attribute rarely changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; No history. If someone asks &amp;quot;How much revenue came from New York customers last quarter?&amp;quot; they get the wrong answer because customer 1042 is now labeled Chicago.&lt;/p&gt;
&lt;h2&gt;Type 2: Track Full History&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/06/scd-type-2.png&quot; alt=&quot;SCD Type 2 showing multiple rows for the same entity with effective and expiry dates&quot;&gt;&lt;/p&gt;
&lt;p&gt;Type 2 inserts a new row for each change. The original row is marked as expired, and the new row becomes the current version.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Original row (now expired)
-- customer_key: 1042, city: New York, effective_date: 2023-01-15, expiry_date: 2025-03-01, is_current: FALSE

-- New row
INSERT INTO dim_customers (customer_key, customer_id, city, effective_date, expiry_date, is_current)
VALUES (5001, 1042, &apos;Chicago&apos;, &apos;2025-03-01&apos;, &apos;9999-12-31&apos;, TRUE);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now each fact row references a specific version of the customer dimension. Sales from Q1 2024 reference customer_key 1042 (New York). Sales from Q2 2025 reference customer_key 5001 (Chicago). Historical reports are accurate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use Type 2:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When historical accuracy matters (most analytics use cases)&lt;/li&gt;
&lt;li&gt;When you need to analyze trends by attribute value over time&lt;/li&gt;
&lt;li&gt;When regulatory or audit requirements demand change tracking&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; The dimension table grows. A customer who changes city three times has three rows. Queries must filter on &lt;code&gt;is_current = TRUE&lt;/code&gt; for current-state analysis, or join on date ranges for point-in-time analysis. This adds complexity to every query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Surrogate keys are essential.&lt;/strong&gt; The natural business key (customer_id = 1042) appears in multiple rows. A surrogate key (customer_key, auto-incremented) uniquely identifies each version. Fact tables reference the surrogate key, not the natural key.&lt;/p&gt;
&lt;h2&gt;Type 3: Track One Change&lt;/h2&gt;
&lt;p&gt;Type 3 adds a column for the previous value instead of adding a row.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE dim_customers ADD COLUMN previous_city VARCHAR(100);

UPDATE dim_customers
SET previous_city = city, city = &apos;Chicago&apos;
WHERE customer_id = 1042;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The table now has both &lt;code&gt;city = &apos;Chicago&apos;&lt;/code&gt; and &lt;code&gt;previous_city = &apos;New York&apos;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use Type 3:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When you need quick access to both the current and immediately prior value&lt;/li&gt;
&lt;li&gt;When only one level of history matters&lt;/li&gt;
&lt;li&gt;When the dimension changes infrequently&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; You only track one change deep. If the customer moves again, the previous value is overwritten. Type 3 is rarely used in practice because most use cases require either no history (Type 1) or full history (Type 2).&lt;/p&gt;
&lt;h2&gt;Choosing the Right Type&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Type 1 (Overwrite)&lt;/th&gt;
&lt;th&gt;Type 2 (New Row)&lt;/th&gt;
&lt;th&gt;Type 3 (New Column)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;History preserved&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;One level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dimension growth&lt;/td&gt;
&lt;td&gt;No growth&lt;/td&gt;
&lt;td&gt;Grows over time&lt;/td&gt;
&lt;td&gt;No growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query complexity&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Moderate (date filtering)&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Error corrections&lt;/td&gt;
&lt;td&gt;Trend analysis&lt;/td&gt;
&lt;td&gt;Before/after comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage impact&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation effort&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most analytics organizations use &lt;strong&gt;Type 2 as the default&lt;/strong&gt; and Type 1 for error corrections. Type 3 is a niche choice for specific before/after reporting needs.&lt;/p&gt;
&lt;p&gt;In a lakehouse environment, Iceberg&apos;s time-travel feature provides an implicit form of historical tracking at the table level. You can query any past snapshot of a table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM dim_customers FOR SYSTEM_TIME AS OF &apos;2024-06-15T00:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This doesn&apos;t replace SCD Type 2 (which tracks attribute-level changes with effective dates), but it provides a safety net for point-in-time analysis.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; support both approaches. SQL views can present a current-state view (filtering &lt;code&gt;WHERE is_current = TRUE&lt;/code&gt;) or an as-of view (joining on effective dates). Wikis document which SCD type each dimension uses, giving AI agents and analysts the context they need to write correct queries.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/06/scd-decision-guide.png&quot; alt=&quot;Choosing between SCD types based on reporting requirements and complexity tolerance&quot;&gt;&lt;/p&gt;
&lt;p&gt;Audit your dimension tables. For each one, decide: Does historical accuracy matter for this attribute? If yes, implement Type 2. If the attribute changes rarely and history doesn&apos;t matter, Type 1 is sufficient. Document your choice , when the next engineer encounters the dimension, they need to know whether they&apos;re looking at current state or historical versions.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Role of the Semantic Layer in Data Governance</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-data-governance/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-data-governance/</guid><description>
![Data governance through a semantic layer : centralized policies and documentation](/assets/images/semantic_layer/06/governance-semantic.png)

Most ...</description><pubDate>Wed, 18 Feb 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/06/governance-semantic.png&quot; alt=&quot;Data governance through a semantic layer : centralized policies and documentation&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most organizations have a data governance policy. It lives in a Confluence page. It defines who owns what data, what terms mean, and who should have access. And almost nobody follows it, because it&apos;s not enforced where queries actually run.&lt;/p&gt;
&lt;p&gt;A semantic layer changes that. It moves governance from a document into the query path, where every rule is applied automatically, for every user, through every tool.&lt;/p&gt;
&lt;h2&gt;Governance on Paper vs. Governance in Practice&lt;/h2&gt;
&lt;p&gt;Data governance fails when it depends on people doing the right thing manually. A policy says &amp;quot;Revenue means completed orders minus refunds.&amp;quot; An analyst writes a slightly different formula. A dashboard uses the wrong table. An AI agent invents its own definition. The governance policy exists. Nobody follows it. And the organization makes decisions on inconsistent data.&lt;/p&gt;
&lt;p&gt;The root cause isn&apos;t that people are careless. It&apos;s that governance is separated from the systems people actually use to query data. Enforcement happens in a side channel :  documentation, review processes, audit logs ,  not in the query itself.&lt;/p&gt;
&lt;h2&gt;Centralized Definitions Eliminate Conflicting Metrics&lt;/h2&gt;
&lt;p&gt;A semantic layer solves the definition problem by making the governance policy code.&lt;/p&gt;
&lt;p&gt;Revenue isn&apos;t a paragraph in a wiki. It&apos;s a SQL view:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW business.revenue AS
SELECT
    OrderDate,
    Region,
    SUM(OrderTotal) AS Revenue
FROM silver.orders_enriched
WHERE Status = &apos;completed&apos; AND Refunded = FALSE
GROUP BY OrderDate, Region;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every dashboard, notebook, and AI agent that needs Revenue queries this view. There&apos;s no alternative formula to use. The semantic layer IS the governance for this metric.&lt;/p&gt;
&lt;p&gt;When the definition changes (say, a new refund category is added), the view is updated once, and every consumer gets the new logic automatically. No rollout. No migration. No &amp;quot;did everyone update their dashboard?&amp;quot;&lt;/p&gt;
&lt;h2&gt;Access Policies Enforced at Query Time&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/06/governance-enforcement.png&quot; alt=&quot;All query paths routing through a single governance enforcement gate&quot;&gt;&lt;/p&gt;
&lt;p&gt;The second governance gap: access control. Most organizations enforce security at the BI tool level. Tableau restricts who sees which dashboard. Power BI applies row-level filters. But if someone opens a SQL client and queries the underlying table directly, those filters don&apos;t apply.&lt;/p&gt;
&lt;p&gt;A semantic layer enforces policies at a lower level. When access control exists in the semantic layer, it applies to every query path:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query Path&lt;/th&gt;
&lt;th&gt;BI-Level Security&lt;/th&gt;
&lt;th&gt;Semantic Layer Security&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL notebook&lt;/td&gt;
&lt;td&gt;Not enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agent&lt;/td&gt;
&lt;td&gt;Not enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API/programmatic access&lt;/td&gt;
&lt;td&gt;Not enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio implements this through &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Fine-Grained Access Control (FGAC)&lt;/a&gt;: policies defined as UDFs that filter rows and mask columns based on the querying user&apos;s role. These policies are applied at the virtual dataset (view) level. A regional manager queries &lt;code&gt;business.revenue&lt;/code&gt; and sees only their region. A data engineer sees all regions. Same view, same SQL, different results based on identity.&lt;/p&gt;
&lt;p&gt;This approach eliminates the &amp;quot;security gap&amp;quot; that appears when users bypass BI tools. Every route to the data flows through the semantic layer. Every route inherits the policies.&lt;/p&gt;
&lt;h2&gt;Lineage and Accountability Through Views&lt;/h2&gt;
&lt;p&gt;The layered view architecture (Bronze → Silver → Gold) that a semantic layer uses is inherently traceable. Every Gold metric traces back to its Silver business logic, which traces back to the Bronze source mapping, which traces back to raw data.&lt;/p&gt;
&lt;p&gt;This traceability matters for compliance. When an auditor asks &amp;quot;Where does your Revenue number come from?&amp;quot;, you don&apos;t search through dashboards and notebooks. You follow the view chain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gold.monthly_revenue_by_region&lt;/code&gt; → references &lt;code&gt;silver.orders_enriched&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;silver.orders_enriched&lt;/code&gt; → joins &lt;code&gt;bronze.orders_raw&lt;/code&gt; with &lt;code&gt;bronze.customers_raw&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bronze.orders_raw&lt;/code&gt; → maps to &lt;code&gt;production.public.orders&lt;/code&gt; in PostgreSQL&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Every step is documented. Every transformation is visible. The lineage isn&apos;t reconstructed after the fact : it&apos;s structural.&lt;/p&gt;
&lt;h2&gt;Documentation as a Governance Tool&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/06/governance-labels.png&quot; alt=&quot;Data governance labels and tags applied to tables for compliance&quot;&gt;&lt;/p&gt;
&lt;p&gt;Governance is also about discoverability. Can someone find the right dataset without messaging five people? Can they tell whether a view is production-ready or experimental?&lt;/p&gt;
&lt;p&gt;Two mechanisms handle this in a semantic layer:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wikis&lt;/strong&gt; attach human-readable (and AI-readable) descriptions to tables, columns, and views. They explain what data represents, where it comes from, and any caveats. A column named &lt;code&gt;cltv&lt;/code&gt; gets a description: &amp;quot;Customer Lifetime Value, calculated as total revenue from first purchase to current date, excluding refunds.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Labels&lt;/strong&gt; categorize data for governance workflows. A label like &amp;quot;PII&amp;quot; triggers automatic column masking. A label like &amp;quot;Certified&amp;quot; indicates the view has been reviewed and approved for production use. A label like &amp;quot;Deprecated&amp;quot; warns consumers to migrate to the replacement.&lt;/p&gt;
&lt;p&gt;For organizations with thousands of datasets, manual documentation is impractical. Dremio&apos;s generative AI auto-generates Wiki descriptions by sampling table data and suggests Labels based on column content. This bootstraps documentation to 70% coverage automatically. The data team fills in what the AI misses.&lt;/p&gt;
&lt;h2&gt;Certification and Change Management&lt;/h2&gt;
&lt;p&gt;Not all views are equal. A semantic layer should distinguish between views that are experimental, under review, and production-ready.&lt;/p&gt;
&lt;p&gt;A practical certification workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Draft&lt;/strong&gt;: New view created by an analyst. Not yet reviewed. Labeled &amp;quot;Draft.&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reviewed&lt;/strong&gt;: View reviewed by the data team. Business logic validated. Documentation complete.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Certified&lt;/strong&gt;: View approved for production use. Labeled &amp;quot;Certified.&amp;quot; Available in production dashboards and to AI agents.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each Certified view should have a documented owner : the person accountable for its accuracy and freshness. When business requirements change, the owner updates the view and documentation together. Changes are reviewed before the &amp;quot;Certified&amp;quot; label is reapplied.&lt;/p&gt;
&lt;p&gt;This workflow doesn&apos;t require advanced tooling. Labels, Wikis, and a team agreement on the process are sufficient. What matters is that governance is visible inside the semantic layer, not tracked in a separate system.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Audit your top 10 business metrics. For each one, ask three questions: Is the formula defined in one place? Is access control enforced at the query level (not just the BI tool)? Can you trace the number back to its raw source in under 60 seconds? Every &amp;quot;no&amp;quot; is a governance gap that a semantic layer closes.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Schema Evolution Without Breaking Consumers</title><link>https://iceberglakehouse.com/posts/2026-02-debp-schema-evolution/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-schema-evolution/</guid><description>
![Schema as a contract between producers and consumers with version tracking](/assets/images/debp/05/schema-contract.png)

A source team renames a co...</description><pubDate>Wed, 18 Feb 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/05/schema-contract.png&quot; alt=&quot;Schema as a contract between producers and consumers with version tracking&quot;&gt;&lt;/p&gt;
&lt;p&gt;A source team renames a column from &lt;code&gt;user_id&lt;/code&gt; to &lt;code&gt;customer_id&lt;/code&gt;. Twelve hours later, five dashboards show blank values, two ML pipelines fail, and the data engineering team spends the morning tracing a problem that could have been prevented with one rule: treat your schema like an API.&lt;/p&gt;
&lt;p&gt;Schema evolution is the practice of changing data structures without breaking the systems that depend on them. Get it right, and your data platform stays flexible. Get it wrong, and every schema change becomes an emergency.&lt;/p&gt;
&lt;h2&gt;Your Schema Is an API&lt;/h2&gt;
&lt;p&gt;When an application team changes a REST API endpoint, they version it. They deprecate the old version. They give consumers time to migrate. They don&apos;t silently rename fields and hope nobody notices.&lt;/p&gt;
&lt;p&gt;Data schemas deserve the same discipline. Your columns are fields. Your tables are endpoints. Your downstream consumers :  dashboards, ML pipelines, reports, other pipelines ,  are API clients. When you change the schema, you change the contract.&lt;/p&gt;
&lt;p&gt;The difference: API changes are usually intentional and reviewed. Schema changes often happen accidentally : a source system updates its export format, an engineer renames a column for readability, a new data type is introduced. Without guardrails, these changes propagate downstream silently.&lt;/p&gt;
&lt;h2&gt;Safe vs. Breaking Changes&lt;/h2&gt;
&lt;p&gt;Not all schema changes carry the same risk:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backward-compatible (safe) changes:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adding a new optional column with a default value&lt;/li&gt;
&lt;li&gt;Widening a data type (INT to BIGINT, FLOAT to DOUBLE)&lt;/li&gt;
&lt;li&gt;Adding documentation or metadata to columns&lt;/li&gt;
&lt;li&gt;Reordering columns (if consumers reference by name, not position)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Breaking changes:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Removing a column that consumers reference&lt;/li&gt;
&lt;li&gt;Renaming a column without maintaining the old name&lt;/li&gt;
&lt;li&gt;Narrowing a data type (BIGINT to INT : values may overflow)&lt;/li&gt;
&lt;li&gt;Changing the semantic meaning of a column (e.g., &lt;code&gt;revenue&lt;/code&gt; from gross to net)&lt;/li&gt;
&lt;li&gt;Changing nullability (nullable to non-nullable breaks inserts with nulls)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rule: backward-compatible changes can be deployed without coordination. Breaking changes require a migration plan.&lt;/p&gt;
&lt;h2&gt;The Additive-Only Pattern&lt;/h2&gt;
&lt;p&gt;The simplest schema evolution strategy: never remove or rename columns. Only add new ones.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/05/additive-evolution.png&quot; alt=&quot;Additive schema evolution: columns only added, never removed or renamed&quot;&gt;&lt;/p&gt;
&lt;p&gt;When a column needs to be replaced:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Add the new column alongside the old one&lt;/li&gt;
&lt;li&gt;Update producers to populate both columns&lt;/li&gt;
&lt;li&gt;Migrate consumers to the new column one at a time&lt;/li&gt;
&lt;li&gt;Once all consumers have migrated, mark the old column as deprecated&lt;/li&gt;
&lt;li&gt;Remove the old column only after a deprecation period (e.g., 90 days)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pattern is boring : and that&apos;s the point. Boring is reliable. Adding a column never breaks existing queries. Consumers that don&apos;t need the new column ignore it. Consumers that do need it can adopt it on their own schedule.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Table width grows over time. Schemas accumulate deprecated columns. This is an acceptable cost compared to production outages.&lt;/p&gt;
&lt;h2&gt;Schema Versioning and Migration&lt;/h2&gt;
&lt;p&gt;For changes that can&apos;t be additive (fundamental restructuring, data model migrations), use explicit versioning:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version in the table name.&lt;/strong&gt; &lt;code&gt;customers_v1&lt;/code&gt;, &lt;code&gt;customers_v2&lt;/code&gt; coexist. Consumers migrate from v1 to v2 at their own pace. A view named &lt;code&gt;customers&lt;/code&gt; points to the current version.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version in metadata.&lt;/strong&gt; Store a schema version field in each record or partition. Consumers check the version and apply the appropriate parsing logic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema registries.&lt;/strong&gt; Centralized systems that store and validate schemas. Producers register their schema. Consumers declare their expected schema. The registry checks compatibility and rejects breaking changes.&lt;/p&gt;
&lt;p&gt;Schema registries enforce rules automatically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BACKWARD compatible: new schema can read data written by old schema&lt;/li&gt;
&lt;li&gt;FORWARD compatible: old schema can read data written by new schema&lt;/li&gt;
&lt;li&gt;FULL compatible: both backward and forward compatible&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Contract Enforcement at Pipeline Boundaries&lt;/h2&gt;
&lt;p&gt;Don&apos;t rely on conventions (&amp;quot;we don&apos;t rename columns&amp;quot;). Enforce contracts programmatically at pipeline boundaries:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At ingestion.&lt;/strong&gt; Compare the incoming data schema against the expected schema. If columns are missing, added, or retyped, log the difference and alert. For safe changes, proceed and notify. For breaking changes, halt and quarantine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At transformation.&lt;/strong&gt; Validate that every column referenced in SQL or transformation logic exists in the input schema. Catch missing-column errors at validation time, not at runtime.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At serving.&lt;/strong&gt; Validate that output schemas match the contracts expected by consumers. If a downstream dashboard expects column &lt;code&gt;revenue&lt;/code&gt;, verify it exists and has the correct type before the pipeline marks the job as successful.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/05/contract-enforcement.png&quot; alt=&quot;Contract enforcement at pipeline boundaries: ingestion, transformation, serving&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Document the schema of your five most critical tables: column names, types, nullability, and a one-line description. That&apos;s your version 1 contract. Set up an automated check that compares incoming data against this contract and alerts on any deviation. You&apos;ll catch the next breaking change before it breaks anything.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Dimensional Modeling: Facts, Dimensions, and Grains</title><link>https://iceberglakehouse.com/posts/2026-02-dm-dimensional-modeling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-dimensional-modeling/</guid><description>
![Dimensional model showing a central fact table connected to surrounding dimension tables](/assets/images/data_modeling/05/dimensional-modeling.png)...</description><pubDate>Wed, 18 Feb 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/05/dimensional-modeling.png&quot; alt=&quot;Dimensional model showing a central fact table connected to surrounding dimension tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;Dimensional modeling is the most widely used approach for organizing analytics data. Developed by Ralph Kimball, it structures data into two types of tables: facts (what happened) and dimensions (the context around what happened). The technique optimizes for query speed and business readability, not for storage efficiency or transactional integrity.&lt;/p&gt;
&lt;p&gt;If your goal is to answer business questions quickly and consistently, dimensional modeling is where you start.&lt;/p&gt;
&lt;h2&gt;Facts and Dimensions: The Two Building Blocks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable events. Each row represents something that happened: a sale, a click, a shipment, a login. Fact tables are narrow (a few foreign keys and numeric measures) and deep (millions or billions of rows).&lt;/p&gt;
&lt;p&gt;A typical sales fact table might look like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE fact_sales (
    sale_id BIGINT,
    date_key INT,
    customer_key INT,
    product_key INT,
    store_key INT,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(12,2)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; provide context. They describe the &amp;quot;who, what, where, when, and how&amp;quot; behind each fact. Dimension tables are wide (many descriptive columns) and shallow (thousands to millions of rows).&lt;/p&gt;
&lt;p&gt;A customer dimension might include: customer_name, email, signup_date, city, state, country, segment, lifetime_value, acquisition_channel.&lt;/p&gt;
&lt;p&gt;Every analysis query joins a fact table to one or more dimension tables. &amp;quot;Revenue by region&amp;quot; joins the sales fact to the geography dimension. &amp;quot;Revenue by product category&amp;quot; joins the sales fact to the product dimension. The fact table provides the number; the dimensions provide the labels.&lt;/p&gt;
&lt;h2&gt;Declaring the Grain&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/05/grain-declaration.png&quot; alt=&quot;Grain declaration as the foundation : one row per transaction per line item&quot;&gt;&lt;/p&gt;
&lt;p&gt;The grain is the most important decision in dimensional modeling. It declares what one row in your fact table represents.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;One row per order line item&amp;quot; : each product within an order gets its own row&lt;/li&gt;
&lt;li&gt;&amp;quot;One row per daily customer session&amp;quot; : each customer&apos;s daily activity is aggregated into one row&lt;/li&gt;
&lt;li&gt;&amp;quot;One row per monthly account balance&amp;quot; : snapshot taken once per month&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Getting the grain right matters because:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Too coarse: You lose detail. If your grain is &amp;quot;one row per order&amp;quot; you can&apos;t analyze individual line items.&lt;/li&gt;
&lt;li&gt;Too fine: You create an enormous table that&apos;s expensive to query. If your grain is &amp;quot;one row per page view&amp;quot; in a high-traffic application, the table grows by billions of rows per month.&lt;/li&gt;
&lt;li&gt;Inconsistent: If some rows represent individual items and others represent aggregated totals, every calculation produces wrong results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Declare the grain first. Then identify which dimensions apply at that grain, and which numeric measures belong in the fact table. This order is not optional : skip it, and the model breaks down.&lt;/p&gt;
&lt;h2&gt;Designing Fact Tables&lt;/h2&gt;
&lt;p&gt;Three types of fact tables handle different analytical patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transaction facts&lt;/strong&gt; record individual events. One row per sale, one row per click. This is the most common type. It supports the most detailed analysis but produces the largest tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Periodic snapshot facts&lt;/strong&gt; capture the state at regular intervals. One row per account per month. Useful for balance-tracking, inventory levels, and any measure that accumulates over time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Accumulating snapshot facts&lt;/strong&gt; track the lifecycle of a process. One row per order, with date columns for each milestone (order_placed, payment_received, shipped, delivered). Useful for analyzing process efficiency and bottleneck identification.&lt;/p&gt;
&lt;p&gt;Best practices for fact tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Keep facts additive when possible (SUM-able across dimensions)&lt;/li&gt;
&lt;li&gt;Avoid storing text in fact tables , that belongs in dimensions&lt;/li&gt;
&lt;li&gt;Use surrogate keys (integers) for dimension references, not natural keys&lt;/li&gt;
&lt;li&gt;Never mix grains in one fact table&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Designing Dimension Tables&lt;/h2&gt;
&lt;p&gt;Well-designed dimensions follow predictable patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Denormalize.&lt;/strong&gt; Include all descriptive attributes in one table. Product name, category, subcategory, brand, manufacturer, department : all in dim_products. This eliminates joins and makes queries readable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use surrogate keys.&lt;/strong&gt; Assign an integer key (product_key) that acts as the primary key. Keep the natural business key (product_sku) as a regular attribute. Surrogate keys insulate your model from source system key changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Add audit columns.&lt;/strong&gt; Include effective_date, expiry_date, and is_current flag for tracking changes over time (Slowly Changing Dimensions : covered in a separate article).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Include &amp;quot;junk&amp;quot; dimensions.&lt;/strong&gt; Low-cardinality flags and indicators (is_promotional, is_online, payment_type) can be combined into a single &amp;quot;junk dimension&amp;quot; instead of cluttering the fact table.&lt;/p&gt;
&lt;h2&gt;Conformed Dimensions&lt;/h2&gt;
&lt;p&gt;A conformed dimension is shared across multiple fact tables. The best example is the Date dimension : every fact table references dates, and they should all use the same date dimension to ensure consistent filtering and grouping.&lt;/p&gt;
&lt;p&gt;Other conformed dimensions: Customer, Product, Employee, Geography. When Sales and Support both reference the same dim_customers table, you can analyze customer behavior across both domains without reconciling different customer definitions.&lt;/p&gt;
&lt;p&gt;Conformed dimensions are the connective tissue of a dimensional model. Without them, each fact table exists in isolation.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; support dimensional modeling through virtual datasets. Fact and dimension views live in the Silver layer of a Medallion Architecture. Conformed dimensions are defined once and referenced by multiple fact views. Wikis document what each dimension attribute means, and AI agents use that documentation to generate accurate queries.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/05/conformed-dimensions.png&quot; alt=&quot;Conformed dimensions shared across multiple fact tables in a unified model&quot;&gt;&lt;/p&gt;
&lt;p&gt;Start your dimensional model with one business process : the one your team queries most. Declare the grain. Identify the dimensions. Build the fact table. Then expand: pick the next business process, reuse the conformed dimensions, and add new ones as needed.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Why Your AI Initiatives Fail Without a Semantic Layer</title><link>https://iceberglakehouse.com/posts/2026-02-sl-why-ai-fails-without-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-why-ai-fails-without-semantic-layer/</guid><description>
![AI with vs without a semantic layer : failure modes and fixes](/assets/images/semantic_layer/05/ai-semantic-layer.png)

Your team builds an AI agen...</description><pubDate>Wed, 18 Feb 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/05/ai-semantic-layer.png&quot; alt=&quot;AI with vs without a semantic layer : failure modes and fixes&quot;&gt;&lt;/p&gt;
&lt;p&gt;Your team builds an AI agent. It connects to your data warehouse. A product manager types &amp;quot;What was revenue last quarter?&amp;quot; and gets a number. The number is wrong. Nobody knows it&apos;s wrong until Finance runs the same query manually and gets a different result.&lt;/p&gt;
&lt;p&gt;This happens constantly. And the problem isn&apos;t the AI model. It&apos;s the missing layer between the model and your data.&lt;/p&gt;
&lt;h2&gt;The Promise vs. the Reality&lt;/h2&gt;
&lt;p&gt;Natural language analytics is the most requested feature in every data platform survey. Business users want to ask questions in plain English and get accurate answers. No SQL. No tickets. No waiting.&lt;/p&gt;
&lt;p&gt;The technology exists. Large language models can generate SQL from natural language with impressive accuracy. But accuracy on syntax isn&apos;t accuracy on meaning. An LLM can write grammatically correct SQL that returns the wrong answer because it doesn&apos;t understand your business definitions.&lt;/p&gt;
&lt;p&gt;A semantic layer provides those definitions. Without one, AI analytics is a demonstration that works in a meeting but fails in production.&lt;/p&gt;
&lt;h2&gt;Five Ways AI Goes Wrong Without a Semantic Layer&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/05/ai-hallucination.png&quot; alt=&quot;AI agent confused by raw data : hallucinating metrics and joins&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Metric Hallucination&lt;/h3&gt;
&lt;p&gt;Your LLM decides that Revenue = &lt;code&gt;SUM(amount)&lt;/code&gt; from the &lt;code&gt;transactions&lt;/code&gt; table. But your actual Revenue formula is &lt;code&gt;SUM(order_total) WHERE status = &apos;completed&apos; AND refunded = FALSE&lt;/code&gt; from the &lt;code&gt;orders&lt;/code&gt; table. The AI&apos;s number is plausible. It&apos;s also wrong by 15%. Nobody catches it because it looks reasonable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Canonical metric definitions in virtual datasets. The AI references the view, not its own invented formula.&lt;/p&gt;
&lt;h3&gt;Join Confusion&lt;/h3&gt;
&lt;p&gt;There are three paths from &lt;code&gt;orders&lt;/code&gt; to &lt;code&gt;customers&lt;/code&gt;: via &lt;code&gt;customer_id&lt;/code&gt;, via &lt;code&gt;billing_address_id&lt;/code&gt;, and via &lt;code&gt;shipping_address_id&lt;/code&gt;. For revenue analysis, you want &lt;code&gt;customer_id&lt;/code&gt;. The LLM picks &lt;code&gt;billing_address_id&lt;/code&gt; because it seems logical. The numbers are close enough that the mistake survives review.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Pre-defined join relationships in the semantic model. The AI follows the approved path.&lt;/p&gt;
&lt;h3&gt;Column Misinterpretation&lt;/h3&gt;
&lt;p&gt;A column called &lt;code&gt;date&lt;/code&gt; appears in the &lt;code&gt;orders&lt;/code&gt; table. Is it the order date, ship date, or invoice date? The LLM guesses order date. It&apos;s actually the ship date. Every time-based query is off by 2-5 days.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Wiki descriptions on every column. The semantic layer tells the AI that &lt;code&gt;date&lt;/code&gt; is &lt;code&gt;ShipDate&lt;/code&gt; and &lt;code&gt;OrderDate&lt;/code&gt; is the field to use for time-based revenue analysis.&lt;/p&gt;
&lt;h3&gt;Security Bypass&lt;/h3&gt;
&lt;p&gt;Your BI dashboard applies row-level security so regional managers only see their region&apos;s data. The AI agent queries the raw table directly, bypassing the BI layer. A regional manager asks about &amp;quot;their&amp;quot; revenue and sees the entire company&apos;s numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Fine-Grained Access Control enforced at the semantic layer. The AI queries views, not raw tables. Security policies travel with the data regardless of the access path.&lt;/p&gt;
&lt;h3&gt;Inconsistent Results&lt;/h3&gt;
&lt;p&gt;The same question asked twice generates different SQL because the LLM&apos;s output is probabilistic. Monday&apos;s answer: $4.2M. Wednesday&apos;s answer: $4.5M. Both are &amp;quot;correct&amp;quot; given the SQL generated. Neither matches Finance&apos;s number.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Deterministic definitions in the semantic layer. The same question always resolves to the same view, the same formula, the same result.&lt;/p&gt;
&lt;h2&gt;How a Semantic Layer Grounds AI&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/05/ai-with-context.png&quot; alt=&quot;AI agent successfully using a semantic layer to produce accurate results&quot;&gt;&lt;/p&gt;
&lt;p&gt;Each failure maps to a specific semantic layer component:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;Semantic Layer Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metric hallucination&lt;/td&gt;
&lt;td&gt;Virtual datasets with canonical formulas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Join confusion&lt;/td&gt;
&lt;td&gt;Pre-defined join relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Column misinterpretation&lt;/td&gt;
&lt;td&gt;Wiki descriptions on every field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security bypass&lt;/td&gt;
&lt;td&gt;Access policies enforced at the view level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inconsistent results&lt;/td&gt;
&lt;td&gt;Deterministic definitions (same question = same SQL)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why platforms that take AI analytics seriously embed the semantic layer directly into the query engine. &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s approach&lt;/a&gt; combines virtual datasets, Wikis, Labels, and Fine-Grained Access Control into a single layer that both humans and AI agents consume. The AI doesn&apos;t just generate SQL. It consults the semantic layer to understand what the data means, which formulas to use, and what the querying user is allowed to see.&lt;/p&gt;
&lt;h2&gt;What AI-Ready Architecture Looks Like&lt;/h2&gt;
&lt;p&gt;An AI-ready data platform doesn&apos;t just connect an LLM to a database. It puts a structured context layer in between:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Semantic layer&lt;/strong&gt; defines metrics, documents columns, and enforces security&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI agent&lt;/strong&gt; reads the semantic layer to understand business context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query engine&lt;/strong&gt; executes the AI-generated SQL with full optimization (caching, reflections, pushdowns)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Results&lt;/strong&gt; are returned in business terms through the same interface humans use&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Without step 1, the AI is just a SQL autocomplete tool with no business understanding. It writes syntactically valid queries that produce semantically wrong answers. The semantic layer is the difference between a toy demo and a production-grade AI analytics system.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;If your AI analytics initiative is producing unreliable results, don&apos;t upgrade the model. Audit the context the model has access to. Can it read your metric definitions? Column descriptions? Security policies? If the answer is no, the fix isn&apos;t a better LLM. It&apos;s a semantic layer.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Idempotent Pipelines: Build Once, Run Safely Forever</title><link>https://iceberglakehouse.com/posts/2026-02-debp-idempotent-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-idempotent-pipelines/</guid><description>
![Pipeline running multiple times and converging to the same result](/assets/images/debp/04/idempotent-pipeline.png)

A pipeline runs, processes 100,...</description><pubDate>Wed, 18 Feb 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/04/idempotent-pipeline.png&quot; alt=&quot;Pipeline running multiple times and converging to the same result&quot;&gt;&lt;/p&gt;
&lt;p&gt;A pipeline runs, processes 100,000 records, and loads them into the target table. Then it fails on a downstream step. The orchestrator retries the entire job. Now the table has 200,000 records : 100,000 of them duplicates. Revenue reports double. Dashboards misfire. Someone spends the next four hours manually deduplicating records and explaining to stakeholders why the numbers were wrong.&lt;/p&gt;
&lt;p&gt;This is the cost of not building idempotent pipelines.&lt;/p&gt;
&lt;h2&gt;What Idempotency Means for Pipelines&lt;/h2&gt;
&lt;p&gt;An idempotent operation produces the same result no matter how many times you execute it. For data pipelines, that means: running the same job twice :  or ten times ,  leaves the target data in the exact same state as running it once.&lt;/p&gt;
&lt;p&gt;This property matters because retries are inevitable. Orchestrators retry failed tasks. Backfill jobs reprocess historical data. Network glitches cause at-least-once delivery. Engineers manually rerun jobs during debugging. Without idempotency, every one of these events risks data corruption.&lt;/p&gt;
&lt;p&gt;Idempotency is not about preventing retries. It&apos;s about making retries safe.&lt;/p&gt;
&lt;h2&gt;The Partition Overwrite Pattern&lt;/h2&gt;
&lt;p&gt;The simplest and most reliable idempotency pattern for batch pipelines: overwrite the entire partition.&lt;/p&gt;
&lt;p&gt;Instead of appending rows, your pipeline replaces the complete partition for the time period being processed. For a daily pipeline processing January 15th:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Delete existing data for this partition
DELETE FROM target_table WHERE event_date = &apos;2024-01-15&apos;;

-- Insert fresh data for this partition
INSERT INTO target_table
SELECT * FROM staging_table WHERE event_date = &apos;2024-01-15&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the job reruns, it deletes and recreates the same partition : resulting in the same data. Many table formats support INSERT OVERWRITE or REPLACE PARTITION as an atomic operation, which is even safer because it avoids a window where the partition is empty.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; Daily, hourly, or other time-partitioned batch pipelines. This covers the majority of data warehouse loading patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; You need a clear partitioning key. For non-time-series data, partition overwrite may not apply.&lt;/p&gt;
&lt;h2&gt;The Upsert/MERGE Pattern&lt;/h2&gt;
&lt;p&gt;For data that doesn&apos;t partition cleanly :  or for change data capture (CDC) workloads ,  use MERGE (also called upsert):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO target_table t
USING staging_table s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  t.status = s.status,
  t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, updated_at)
VALUES (s.order_id, s.status, s.updated_at);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/04/merge-pattern.png&quot; alt=&quot;Merge pattern: staging records matched against target by business key, updating or inserting&quot;&gt;&lt;/p&gt;
&lt;p&gt;If the merge runs twice with the same staging data, the result is identical. Existing records update to the same values. New records insert once because they already exist on the second run.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; CDC pipelines, entity-centric data (customers, products, accounts), and slowly changing dimensions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Requirement:&lt;/strong&gt; A reliable business key that uniquely identifies each record. Without one, merges produce inconsistent results.&lt;/p&gt;
&lt;h2&gt;Event Deduplication for Streaming&lt;/h2&gt;
&lt;p&gt;Streaming systems typically guarantee at-least-once delivery, which means the same event can be delivered and processed multiple times. Your pipeline needs to handle this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Process-level deduplication.&lt;/strong&gt; Maintain a set (in-memory, in a key-value store, or in the target database) of recently processed event IDs. Before processing each event, check if its ID has been seen. Skip duplicates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Write-level deduplication.&lt;/strong&gt; Use MERGE or conditional INSERT:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO events (event_id, payload, processed_at)
SELECT event_id, payload, NOW()
FROM incoming_events
WHERE event_id NOT IN (SELECT event_id FROM events);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Windowed deduplication.&lt;/strong&gt; For high-volume streams, maintain dedup state only for a window (e.g., last 24 hours). Events outside the window are assumed to be unique : a practical tradeoff between memory usage and dedup completeness.&lt;/p&gt;
&lt;h2&gt;Anti-Patterns That Break Idempotency&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Blind INSERT/APPEND.&lt;/strong&gt; Every retry adds duplicate rows. This is the default behavior in most systems and the most common cause of data inflation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Auto-incrementing surrogate keys.&lt;/strong&gt; If your pipeline generates IDs at processing time (not from the source data), duplicates get different IDs and look like distinct records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Timestamps as dedup keys.&lt;/strong&gt; Using &lt;code&gt;processed_at&lt;/code&gt; as part of the primary key means the same source record processed at different times produces different target records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;We&apos;ll dedup later.&amp;quot;&lt;/strong&gt; Deferring deduplication to a cleanup job means every consumer between the load and the cleanup sees dirty data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/04/idempotency-antipatterns.png&quot; alt=&quot;Anti-patterns: blind append creating duplicates, timestamp-based keys, deferred dedup&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Identify your five most frequently retried or backfilled pipelines. Check whether they use INSERT or MERGE. If they use INSERT, switch to partition overwrite or MERGE. Run the pipeline twice intentionally and verify the target table has the same row count both times. That&apos;s your idempotency test.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling for the Lakehouse: What Changes</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-lakehouse/</guid><description>
![Traditional data warehouse model vs. open lakehouse model with flexible schema and views](/assets/images/data_modeling/04/lakehouse-data-modeling.p...</description><pubDate>Wed, 18 Feb 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/04/lakehouse-data-modeling.png&quot; alt=&quot;Traditional data warehouse model vs. open lakehouse model with flexible schema and views&quot;&gt;&lt;/p&gt;
&lt;p&gt;Traditional data modeling assumed you controlled the database. You defined schemas up front, enforced foreign keys at write time, and optimized with indexes. The lakehouse changes every one of those assumptions.&lt;/p&gt;
&lt;p&gt;Data lives in open file formats on object storage. Schemas evolve without rewriting data. Queries run through engines that may not enforce relational constraints. The modeling discipline is the same, but the mechanics are different.&lt;/p&gt;
&lt;h2&gt;What&apos;s Different About a Lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse stores data as files :  typically Parquet ,  on object storage like S3 or Azure Blob. An open table format like Apache Iceberg adds structure: schema definitions, partition metadata, snapshot history, and transactional guarantees.&lt;/p&gt;
&lt;p&gt;This architecture gives you more flexibility than a traditional RDBMS, but also more responsibility. There are no foreign key constraints enforced at write time. No triggers. No stored procedures. Referential integrity is your problem to solve in pipelines and views, not something the storage engine handles for you.&lt;/p&gt;
&lt;p&gt;The tradeoff is worth it: open formats, engine portability, cheap storage, and the ability to run multiple compute engines (Spark, Dremio, Flink, Trino) against the same data.&lt;/p&gt;
&lt;h2&gt;Schema-on-Read Changes the Rules&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/04/schema-on-read.png&quot; alt=&quot;Schema-on-write rigid table vs. schema-on-read flexible view layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;In a traditional warehouse, you define the schema before writing data (schema-on-write). Every row must conform to the schema or the insert fails. This guarantees consistency but makes changes expensive. Adding a column means an ALTER TABLE. Changing a data type might require rewriting the entire table.&lt;/p&gt;
&lt;p&gt;In a lakehouse, you can also store data first and apply structure at query time (schema-on-read). Iceberg supports schema evolution natively : add columns, rename columns, widen data types, and reorder fields without rewriting underlying files.&lt;/p&gt;
&lt;p&gt;This flexibility changes how you model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bronze layer&lt;/strong&gt;: Accept data as-is from sources. Apply minimal typing. Don&apos;t reject records that don&apos;t match a rigid schema.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver layer&lt;/strong&gt;: Apply business logic, joins, and type enforcement through SQL views.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold layer&lt;/strong&gt;: Serve consumption-ready datasets with stable, documented schemas.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The model evolves at the view layer, not the storage layer. This makes iteration faster and migration cheaper.&lt;/p&gt;
&lt;h2&gt;The Medallion Architecture as a Modeling Pattern&lt;/h2&gt;
&lt;p&gt;The Medallion Architecture (Bronze → Silver → Gold) is the most common data modeling pattern in lakehouse environments. Each layer is a set of SQL views or managed tables:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bronze (Preparation):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maps raw source data to typed columns&lt;/li&gt;
&lt;li&gt;Renames ambiguous column names&lt;/li&gt;
&lt;li&gt;Applies basic data type casting&lt;/li&gt;
&lt;li&gt;One view per source table&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Silver (Business Logic):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Joins related entities (orders + customers + products)&lt;/li&gt;
&lt;li&gt;Applies business rules (revenue = quantity × price WHERE status = &apos;completed&apos;)&lt;/li&gt;
&lt;li&gt;Filters invalid or duplicate records&lt;/li&gt;
&lt;li&gt;Implements the logical data model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Gold (Application):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tailored views for specific use cases&lt;/li&gt;
&lt;li&gt;Executive dashboards, Sales reports, AI agent context&lt;/li&gt;
&lt;li&gt;Minimal transformation : mostly selecting from Silver views&lt;/li&gt;
&lt;li&gt;Documented with business-friendly names and descriptions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt;, these layers are implemented as virtual datasets (SQL views) organized in Spaces. Each view is documented with Wikis, tagged with Labels, and governed with Fine-Grained Access Control. The logical model lives in the platform, not in scattered dbt files or tribal knowledge.&lt;/p&gt;
&lt;h2&gt;Physical Modeling for Iceberg Tables&lt;/h2&gt;
&lt;p&gt;When you do create physical Iceberg tables (as opposed to views), the modeling considerations differ from traditional RDBMS:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partitioning matters more than indexing.&lt;/strong&gt; Iceberg uses partition pruning instead of traditional B-tree indexes. Choose partition columns based on your most common query filters : typically date columns. Iceberg&apos;s hidden partitioning means users don&apos;t need to know the partition scheme to write efficient queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sort order affects scan performance.&lt;/strong&gt; Within each partition, Iceberg can sort data by specified columns. Sorting by a frequently filtered column (like &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;region&lt;/code&gt;) enables min/max pruning that skips irrelevant files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compaction replaces vacuum.&lt;/strong&gt; Small files accumulate from streaming inserts. Regular compaction rewrites many small files into fewer large files, improving scan performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema evolution is non-destructive.&lt;/strong&gt; Adding a column to an Iceberg table doesn&apos;t rewrite existing files. Old files return &lt;code&gt;null&lt;/code&gt; for the new column. This makes the physical model more adaptable than traditional databases.&lt;/p&gt;
&lt;h2&gt;Challenges to Watch For&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;No referential integrity enforcement.&lt;/strong&gt; The lakehouse won&apos;t stop you from inserting an order with a &lt;code&gt;customer_id&lt;/code&gt; that doesn&apos;t exist in the customers table. Build data quality checks in your pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema drift across sources.&lt;/strong&gt; When sources change their schemas unexpectedly, your Bronze layer must handle it. Design Bronze views to be tolerant of new or missing columns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-reliance on views.&lt;/strong&gt; Views are powerful, but deeply nested views (View D reads from View C reads from View B reads from View A) create performance and debugging challenges. Keep the chain to three levels when possible.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/04/medallion-architecture.png&quot; alt=&quot;Layered view architecture from raw data through business logic to consumption-ready outputs&quot;&gt;&lt;/p&gt;
&lt;p&gt;If you&apos;re moving from a traditional warehouse to a lakehouse, start by recreating your most-used tables as Iceberg tables and your most-used transformations as SQL views. Organize those views into Bronze, Silver, and Gold layers. Measure whether query performance meets your SLAs : and if it doesn&apos;t, add Reflections to optimize the heavy queries without changing the logical model.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic Layer vs. Data Catalog: Complementary, Not Competing</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-data-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-data-catalog/</guid><description>
![Data catalog and semantic layer : complementary systems bridged together](/assets/images/semantic_layer/04/catalog-vs-semantic.png)

&quot;We already ha...</description><pubDate>Wed, 18 Feb 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/04/catalog-vs-semantic.png&quot; alt=&quot;Data catalog and semantic layer : complementary systems bridged together&quot;&gt;&lt;/p&gt;
&lt;p&gt;&amp;quot;We already have a data catalog, so we don&apos;t need a semantic layer.&amp;quot; This is one of the most common misconceptions in modern data architecture. Catalogs and semantic layers both deal with metadata. They both improve data accessibility. But they solve fundamentally different problems.&lt;/p&gt;
&lt;p&gt;Swapping one for the other leaves a critical gap in your stack.&lt;/p&gt;
&lt;h2&gt;What a Data Catalog Does&lt;/h2&gt;
&lt;p&gt;A data catalog is a searchable inventory of your organization&apos;s data assets. Think of it as a library card system for data. It tells you what data exists, where it lives, who owns it, and how it flows through your systems.&lt;/p&gt;
&lt;p&gt;Key functions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Discovery&lt;/strong&gt;: Find tables, views, files, and dashboards by searching keywords, tags, or owners&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lineage&lt;/strong&gt;: Trace how data moves from source to destination, including every transformation along the way&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Governance metadata&lt;/strong&gt;: Track data quality scores, classification (PII, confidential), and compliance status&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Store descriptions of assets, often crowd-sourced from data producers and consumers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A data catalog is fundamentally a &lt;strong&gt;passive system&lt;/strong&gt;. You search it, browse it, and read from it. It doesn&apos;t change how queries execute or how metrics are calculated. It organizes information &lt;em&gt;about&lt;/em&gt; data.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Does&lt;/h2&gt;
&lt;p&gt;A semantic layer defines what data &lt;strong&gt;means&lt;/strong&gt; and how to &lt;strong&gt;use it correctly&lt;/strong&gt;. It&apos;s an active system that sits between your raw data and the tools querying it.&lt;/p&gt;
&lt;p&gt;Key functions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;: Revenue, Churn Rate, Active Users : calculated one way, everywhere&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query translation&lt;/strong&gt;: Converts business questions into optimized SQL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access enforcement&lt;/strong&gt;: Row-level security and column masking applied at query time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Wikis and labels attached to views and columns&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A semantic layer &lt;strong&gt;actively participates&lt;/strong&gt; in every query. When a user asks &amp;quot;What was revenue by region?&amp;quot;, the semantic layer translates &amp;quot;revenue&amp;quot; into the correct SQL formula, joins the right tables, applies security filters, and returns the result.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Comparison&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/04/catalog-vs-semantic-action.png&quot; alt=&quot;Data catalog vs. semantic layer in action : search vs. query&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Data Catalog&lt;/th&gt;
&lt;th&gt;Semantic Layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary question answered&lt;/td&gt;
&lt;td&gt;&amp;quot;What data do we have?&amp;quot;&lt;/td&gt;
&lt;td&gt;&amp;quot;What does this data mean?&amp;quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System behavior&lt;/td&gt;
&lt;td&gt;Passive (search &amp;amp; browse)&lt;/td&gt;
&lt;td&gt;Active (query translation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;All metadata across assets&lt;/td&gt;
&lt;td&gt;Business definitions, metrics, security&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lineage&lt;/td&gt;
&lt;td&gt;Tracks data flow&lt;/td&gt;
&lt;td&gt;Defines calculation logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query execution&lt;/td&gt;
&lt;td&gt;Does not execute queries&lt;/td&gt;
&lt;td&gt;Translates and optimizes queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access control&lt;/td&gt;
&lt;td&gt;Documents policies&lt;/td&gt;
&lt;td&gt;Enforces policies at query time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The catalog tells you a table called &lt;code&gt;orders&lt;/code&gt; exists in the &lt;code&gt;production&lt;/code&gt; schema. The semantic layer tells you that &amp;quot;Revenue&amp;quot; means &lt;code&gt;SUM(orders.total) WHERE status = &apos;completed&apos;&lt;/code&gt;, joins it to &lt;code&gt;customers&lt;/code&gt; on &lt;code&gt;customer_id&lt;/code&gt;, and filters results based on the querying user&apos;s role.&lt;/p&gt;
&lt;h2&gt;Why You Need Both&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A catalog without a semantic layer&lt;/strong&gt;: Users find data but don&apos;t know how to use it correctly. They discover the &lt;code&gt;orders&lt;/code&gt; table but write their own revenue formula, which differs from the formula Finance uses. Data is discoverable but inconsistently interpreted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A semantic layer without a catalog&lt;/strong&gt;: Users get accurate, governed queries for the datasets the semantic layer covers. But they can&apos;t discover datasets outside the layer. New data sources, experimental tables, and raw files remain invisible until someone manually adds views.&lt;/p&gt;
&lt;p&gt;The best architectures integrate both. The catalog handles discovery and lineage across &lt;em&gt;everything&lt;/em&gt;. The semantic layer handles meaning, calculation, and governance for the business-critical datasets that drive decisions.&lt;/p&gt;
&lt;h2&gt;What Integration Looks Like&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/04/catalog-architecture.png&quot; alt=&quot;Catalog and semantic layer combined in an integrated architecture&quot;&gt;&lt;/p&gt;
&lt;p&gt;An integrated system gives you a single interface where data discovery and business context exist side by side. You search the catalog to find a dataset. You see its semantic layer definition :  the metric formulas, documentation, labels, and access policies ,  alongside the catalog metadata (lineage, quality, ownership).&lt;/p&gt;
&lt;p&gt;Dremio achieves this with its &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-delivers-an-apache-iceberg-lakehouse-without-the-headaches/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Open Catalog&lt;/a&gt; (built on Apache Polaris, the open-source Iceberg REST catalog standard) combined with its semantic layer features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open Catalog&lt;/strong&gt; provides the inventory: tables, views, sources, and their lineage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Virtual datasets&lt;/strong&gt; (SQL views) define business logic and metric calculations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wikis&lt;/strong&gt; document what each dataset and column means&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Labels&lt;/strong&gt; tag data for governance and discoverability (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FGAC&lt;/strong&gt; enforces row/column security at query time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI agents benefit from this integration directly. They use the catalog to navigate available datasets (what tables exist in the &amp;quot;Sales&amp;quot; space?) and the semantic layer to generate accurate queries (what does &amp;quot;Revenue&amp;quot; mean, and who can see which rows?). Remove either piece, and the AI is either blind to available data or confidently generating wrong SQL.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Open your current data catalog and pick a business-critical table. Can you see how its key metric is calculated? Who can access which rows? What the column names mean in business terms? If the catalog only shows you &lt;em&gt;that the table exists&lt;/em&gt;, you&apos;ve identified the gap a semantic layer fills.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Quality Is a Pipeline Problem, Not a Dashboard Problem</title><link>https://iceberglakehouse.com/posts/2026-02-debp-data-quality-first/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-data-quality-first/</guid><description>
![Data quality checks enforced at the pipeline validation stage before data reaches consumers](/assets/images/debp/03/data-quality-pipeline.png)

Whe...</description><pubDate>Wed, 18 Feb 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/03/data-quality-pipeline.png&quot; alt=&quot;Data quality checks enforced at the pipeline validation stage before data reaches consumers&quot;&gt;&lt;/p&gt;
&lt;p&gt;When an analyst finds null values in a revenue column, the typical response is to add a calculated field in the BI tool: &lt;code&gt;IF revenue IS NULL THEN 0&lt;/code&gt;. That &amp;quot;fix&amp;quot; doesn&apos;t fix anything. It masks a problem at the source : and every downstream consumer has to independently discover and patch the same issue.&lt;/p&gt;
&lt;p&gt;Data quality is a pipeline problem. It should be enforced where data enters your system, not where it exits as a chart.&lt;/p&gt;
&lt;h2&gt;The Dashboard Isn&apos;t Where Quality Gets Fixed&lt;/h2&gt;
&lt;p&gt;Quality problems that surface in dashboards have already propagated through every layer of your stack: raw tables, transformed models, aggregations, caches, and API endpoints. By the time an analyst spots a zero-revenue row, the bad record has been used to train ML models, trigger automated alerts, and populate executive reports.&lt;/p&gt;
&lt;p&gt;Fixing quality at the point of consumption is reactive, fragmented, and unrepeatable. Every team applies different patches. Every new consumer rediscovers the same problems.&lt;/p&gt;
&lt;p&gt;Fixing quality at the point of ingestion is proactive, centralized, and consistent. Every downstream consumer benefits from the same validated data.&lt;/p&gt;
&lt;h2&gt;Six Dimensions of Data Quality&lt;/h2&gt;
&lt;p&gt;Not all quality problems are the same. Categorizing them helps you build targeted checks:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Completeness.&lt;/strong&gt; Are required fields populated? A customer record missing an email address might be acceptable. A transaction record missing an amount is not.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Accuracy.&lt;/strong&gt; Do values reflect reality? An age of 250 is syntactically valid but factually wrong. Accuracy checks require domain knowledge and range validation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistency.&lt;/strong&gt; Do the same facts agree across sources? If your CRM says a customer is in Texas and your billing system says California, you have a consistency problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Timeliness.&lt;/strong&gt; Did the data arrive when expected? A daily feed that arrives 6 hours late might still be correct : but any dashboards refreshed before it arrived showed stale numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Uniqueness.&lt;/strong&gt; Are there duplicate records? Double-counted revenue is worse than no revenue. Deduplication on business keys (order ID, event ID) is essential.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Validity.&lt;/strong&gt; Do values conform to expected formats and ranges? Dates in the future, negative quantities, email addresses without @ signs : structural validation catches these before they corrupt downstream logic.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/03/quality-dimensions.png&quot; alt=&quot;Six dimensions: completeness, accuracy, consistency, timeliness, uniqueness, validity&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Enforce Quality Inside the Pipeline&lt;/h2&gt;
&lt;p&gt;Add a validation stage between ingestion and transformation. This stage checks every record against defined quality rules and routes failures to a quarantine table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema validation.&lt;/strong&gt; Check column names, data types, and required vs. optional fields. If the source adds or removes a column, catch it here : not when a transformation SQL query fails.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Range and format checks.&lt;/strong&gt; Ensure numeric values fall within expected ranges (0 ≤ price ≤ 1,000,000). Validate date formats, email patterns, and enum values against allowed lists.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Referential checks.&lt;/strong&gt; Verify that foreign key values exist in their reference tables. An order referencing a non-existent customer ID means either the order is invalid or the customer pipeline is behind.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Volume checks.&lt;/strong&gt; Compare the row count of the incoming batch against historical baselines. A daily feed that usually delivers 50,000 rows but arrives with 500 rows should trigger an alert, not proceed silently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Freshness checks.&lt;/strong&gt; Validate that event timestamps fall within the expected window. A batch of events all timestamped from three days ago may indicate a delayed replay, not current data.&lt;/p&gt;
&lt;h2&gt;Quarantine, Don&apos;t Drop&lt;/h2&gt;
&lt;p&gt;When a record fails validation, don&apos;t drop it. Route it to a quarantine table with metadata: which check failed, when, and the original record content.&lt;/p&gt;
&lt;p&gt;Dropping bad records silently creates invisible data loss. Your row counts won&apos;t match, your aggregations will undercount, and no one will know why.&lt;/p&gt;
&lt;p&gt;Quarantined records give you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Visibility.&lt;/strong&gt; You know how many records failed and why.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recovery.&lt;/strong&gt; When the quality rule was too strict (false positive), you can reprocess quarantined records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root cause analysis.&lt;/strong&gt; Patterns in quarantine (e.g., all failures from one source) help you fix the actual problem upstream.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accountability.&lt;/strong&gt; You can report quality rates per source, per pipeline, per day.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Track Quality Like You Track Uptime&lt;/h2&gt;
&lt;p&gt;Pipeline monitoring typically covers: did the job run? Did it succeed? How long did it take? Quality monitoring adds: how many records passed validation? What percentage failed? Which checks triggered the most failures?&lt;/p&gt;
&lt;p&gt;Build quality metrics into your monitoring dashboards:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pass/fail ratio&lt;/strong&gt; per pipeline, per day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failure breakdown&lt;/strong&gt; by quality dimension (completeness, accuracy, etc.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trend lines&lt;/strong&gt; to catch gradual degradation before it becomes critical&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SLA tracking&lt;/strong&gt; for freshness and completeness targets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/03/quality-monitoring.png&quot; alt=&quot;Quality monitoring: pass/fail ratios, trend lines, and SLA tracking alongside pipeline metrics&quot;&gt;&lt;/p&gt;
&lt;p&gt;Alert on quality regressions the same way you alert on pipeline failures. A pipeline that runs successfully but produces 30% invalid records is worse than one that fails outright : because it&apos;s silently wrong.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Audit your most important pipeline. Add a validation stage with checks for completeness, uniqueness, and volume. Route failures to a quarantine table. Within a week, you&apos;ll know more about your data quality than any dashboard could tell you.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Star Schema vs. Snowflake Schema: When to Use Each</title><link>https://iceberglakehouse.com/posts/2026-02-dm-star-schema-vs-snowflake/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-star-schema-vs-snowflake/</guid><description>
![Star schema with central fact table surrounded by denormalized dimension tables](/assets/images/data_modeling/03/star-vs-snowflake.png)

Both star ...</description><pubDate>Wed, 18 Feb 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/03/star-vs-snowflake.png&quot; alt=&quot;Star schema with central fact table surrounded by denormalized dimension tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;Both star schemas and snowflake schemas are dimensional models. They both organize data into fact tables (measurable events) and dimension tables (context about those events). The difference is how they structure the dimensions.&lt;/p&gt;
&lt;p&gt;That structural difference affects query performance, storage efficiency, SQL complexity, and how easily BI tools and AI agents can interpret your data. Here&apos;s how to choose.&lt;/p&gt;
&lt;h2&gt;The Two Patterns of Dimensional Modeling&lt;/h2&gt;
&lt;p&gt;Dimensional modeling separates data into two types:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable events : a sale, a page view, a shipment, a login. Each row represents one event. Columns include numeric measures (revenue, quantity, duration) and foreign keys pointing to dimension tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; provide context for facts , who (customer), what (product), when (date), where (location), how (channel). Dimensions describe the &amp;quot;business words&amp;quot; people use to filter, group, and label their analysis.&lt;/p&gt;
&lt;p&gt;Star and snowflake schemas differ in how they organize those dimension tables.&lt;/p&gt;
&lt;h2&gt;Star Schema: Denormalized Dimensions&lt;/h2&gt;
&lt;p&gt;In a star schema, each dimension is a single, denormalized table. All attributes for a dimension live in one place.&lt;/p&gt;
&lt;p&gt;A product dimension contains the product name, category, subcategory, department, and brand : all in one table. This means some values repeat. Every product in the &amp;quot;Electronics&amp;quot; category stores the string &amp;quot;Electronics&amp;quot; in its row.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fewer joins per query. A typical star schema query joins the fact table to 3-5 dimension tables. That&apos;s it.&lt;/li&gt;
&lt;li&gt;Simpler SQL. Analysts write shorter, more readable queries.&lt;/li&gt;
&lt;li&gt;Faster query performance. Fewer joins means less work for the query engine.&lt;/li&gt;
&lt;li&gt;Better BI tool compatibility. Most BI tools expect star schemas and generate optimal SQL against them.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Data redundancy in dimensions. If the &amp;quot;Electronics&amp;quot; department changes its name, you update it in every row that references it.&lt;/p&gt;
&lt;h2&gt;Snowflake Schema: Normalized Dimensions&lt;/h2&gt;
&lt;p&gt;In a snowflake schema, dimensions are normalized into sub-tables. Instead of one product dimension, you have separate tables for Product, Category, Subcategory, and Department, linked by foreign keys.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/03/snowflake-schema-detail.png&quot; alt=&quot;Snowflake schema with fact table and normalized, branching dimension tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less storage redundancy. Each value stored once. &amp;quot;Electronics&amp;quot; appears in one row of the Department table.&lt;/li&gt;
&lt;li&gt;Single source of truth per attribute. Rename a department in one row instead of thousands.&lt;/li&gt;
&lt;li&gt;Aligns with OLTP normalization practices. Familiar to engineers coming from transactional database backgrounds.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; More joins per query. A query that would join 4 tables in a star schema might join 8-12 tables in a snowflake schema. SQL gets longer, more complex, and harder for analysts to write without help.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Comparison&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Star Schema&lt;/th&gt;
&lt;th&gt;Snowflake Schema&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dimension structure&lt;/td&gt;
&lt;td&gt;Denormalized (flat)&lt;/td&gt;
&lt;td&gt;Normalized (branching)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tables per query&lt;/td&gt;
&lt;td&gt;Fewer (4-6 typical)&lt;/td&gt;
&lt;td&gt;More (8-12 typical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query performance&lt;/td&gt;
&lt;td&gt;Faster&lt;/td&gt;
&lt;td&gt;Slower (more joins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL complexity&lt;/td&gt;
&lt;td&gt;Simpler&lt;/td&gt;
&lt;td&gt;More complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage efficiency&lt;/td&gt;
&lt;td&gt;Lower (some redundancy)&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BI tool compatibility&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;td&gt;Harder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL/pipeline complexity&lt;/td&gt;
&lt;td&gt;Simpler loads&lt;/td&gt;
&lt;td&gt;More complex loads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-service friendliness&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update granularity&lt;/td&gt;
&lt;td&gt;Update many rows&lt;/td&gt;
&lt;td&gt;Update one row&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;When to Choose Which&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose a star schema when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your primary workload is analytics and reporting&lt;/li&gt;
&lt;li&gt;Business users run ad-hoc queries or use BI tools&lt;/li&gt;
&lt;li&gt;Query performance matters more than storage costs&lt;/li&gt;
&lt;li&gt;You want AI agents to generate accurate SQL (fewer joins = fewer mistakes)&lt;/li&gt;
&lt;li&gt;Your dimensions are small enough that redundancy is negligible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Choose a snowflake schema when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dimensions are very large and redundancy has real storage costs&lt;/li&gt;
&lt;li&gt;Regulatory requirements demand a single canonical source per attribute&lt;/li&gt;
&lt;li&gt;Only ETL engineers (not analysts) write queries against the model&lt;/li&gt;
&lt;li&gt;You need strict referential integrity across dimension hierarchies&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Why Star Schema Usually Wins&lt;/h2&gt;
&lt;p&gt;Three changes in modern data platforms have tilted the balance toward star schemas:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Storage is cheap.&lt;/strong&gt; Object storage costs a fraction of a cent per gigabyte per month. The storage savings from normalizing dimensions rarely justify the query complexity cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Columnar formats compress redundancy well.&lt;/strong&gt; Parquet and ORC store data in columns. Repeated values like &amp;quot;Electronics&amp;quot; compress to nearly nothing. The physical storage overhead of a denormalized dimension is much smaller than it appears in row-oriented thinking.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI and self-service need simplicity.&lt;/strong&gt; When an AI agent generates SQL against your data model, fewer tables and fewer joins reduce the chance of hallucinated join paths. When a business analyst builds a report, fewer joins reduce the chance of wrong results.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; make this choice even easier. Virtual datasets let you model star schemas as SQL views without physically copying or denormalizing data. Reflections automatically optimize query performance in the background. You get the simplicity of a star schema with optimized physical performance, regardless of how the underlying data is stored.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/03/star-schema-optimization.png&quot; alt=&quot;Star schema query execution flowing through a query engine with automatic optimization&quot;&gt;&lt;/p&gt;
&lt;p&gt;Take your most-used fact table. Count the joins required to build a complete report. If you&apos;re joining more than five dimension tables, or if dimension tables themselves require sub-joins, consider flattening your dimensions into a star schema. Measure the query performance difference. In most cases, the improvement is significant and the storage increase is negligible.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic Layer vs. Metrics Layer: What&apos;s the Difference?</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-metrics-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-metrics-layer/</guid><description>
![Semantic layer vs metrics layer : the metrics layer is a subset](/assets/images/semantic_layer/03/semantic-vs-metrics.png)

Both terms appear in ev...</description><pubDate>Wed, 18 Feb 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/03/semantic-vs-metrics.png&quot; alt=&quot;Semantic layer vs metrics layer : the metrics layer is a subset&quot;&gt;&lt;/p&gt;
&lt;p&gt;Both terms appear in every modern data architecture diagram. They&apos;re used interchangeably in conference talks, Slack threads, and vendor marketing. And almost nobody defines them precisely.&lt;/p&gt;
&lt;p&gt;Here&apos;s the difference, why it matters, and what it means for how you build your data platform.&lt;/p&gt;
&lt;h2&gt;What a Metrics Layer Does&lt;/h2&gt;
&lt;p&gt;A metrics layer has one job: define how business metrics are calculated and make those definitions available to every tool in your stack.&lt;/p&gt;
&lt;p&gt;Take Revenue. Without a metrics layer, the formula lives in a dashboard filter, a dbt model, a Python notebook, and three different analysts&apos; heads. With a metrics layer, the formula is defined once:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Revenue = SUM(order_total) WHERE status = &apos;completed&apos; AND refunded = FALSE
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every dashboard, API endpoint, and AI agent that needs &amp;quot;Revenue&amp;quot; pulls from this single definition. Change the formula in one place, and it updates everywhere.&lt;/p&gt;
&lt;p&gt;Metrics layers are typically code-defined. &lt;a href=&quot;https://docs.getdbt.com/docs/build/about-metricflow&quot;&gt;dbt&apos;s semantic layer&lt;/a&gt; uses YAML specifications. Cube.js uses JavaScript schemas. The metric definition includes the calculation, the time dimension, the allowed filters, and the grain.&lt;/p&gt;
&lt;p&gt;This is valuable. But it&apos;s incomplete.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Does&lt;/h2&gt;
&lt;p&gt;A semantic layer does everything a metrics layer does, plus more. It covers the full abstraction between raw data and the people (and machines) querying it.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Metrics Layer&lt;/th&gt;
&lt;th&gt;Semantic Layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metric definitions (KPI calculations)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation (table/column descriptions)&lt;/td&gt;
&lt;td&gt;Sometimes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Labels and tags (governance, discoverability)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Join relationships (pre-defined paths)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access policies (row/column security)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query optimization (caching, pre-aggregation)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Often&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A metrics layer tells you &lt;em&gt;how to calculate&lt;/em&gt; a number. A semantic layer tells you &lt;em&gt;what the data means&lt;/em&gt;, &lt;em&gt;how to calculate it&lt;/em&gt;, &lt;em&gt;who can see it&lt;/em&gt;, &lt;em&gt;how to join it&lt;/em&gt;, and &lt;em&gt;where it came from&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;The Relationship: Subset, Not Alternative&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/03/metrics-subset.png&quot; alt=&quot;The metrics layer as a subset within the broader semantic layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;A metrics layer is a component of a semantic layer. Not a replacement.&lt;/p&gt;
&lt;p&gt;Think of it like a spreadsheet. The metrics layer is the formulas: revenue calculations, growth rates, ratios. The semantic layer is the entire workbook: formulas, column headers, sheet labels, formatting, and sharing permissions. You can&apos;t have a useful workbook with just formulas. And you can&apos;t have a complete semantic layer without metric definitions.&lt;/p&gt;
&lt;p&gt;The confusion arose because different vendors built different pieces first. dbt built the metrics layer and called it a &amp;quot;semantic layer.&amp;quot; BI tools like Looker built semantic models (LookML) focused on relationships and query patterns. Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; built a full semantic layer that includes views, documentation, governance, and AI context in one integrated system.&lt;/p&gt;
&lt;h2&gt;Why the Distinction Matters&lt;/h2&gt;
&lt;p&gt;If you build a metrics layer but skip the rest of the semantic layer, you leave three gaps:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No documentation means no AI accuracy.&lt;/strong&gt; When an AI agent generates SQL, it needs more than metric formulas. It needs to know what each column represents, which tables to join, and what filters are valid. Metric definitions alone don&apos;t provide that. Wikis, labels, and column descriptions do. Without them, AI agents hallucinate joins and misinterpret fields.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No security means enforcement happens ad hoc.&lt;/strong&gt; A metrics layer doesn&apos;t include row-level security or column masking. Those policies get applied separately in each BI tool, each notebook, each API. One missed policy, and sensitive data leaks to the wrong role.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No join paths means redundant work.&lt;/strong&gt; If the metrics layer defines &amp;quot;Revenue&amp;quot; but doesn&apos;t define how to connect the Orders table to the Customers table, every consumer figures out the join independently. Some get it right. Some don&apos;t. You get conflicting results from a formula that was supposed to be centralized.&lt;/p&gt;
&lt;h2&gt;What This Looks Like in Practice&lt;/h2&gt;
&lt;p&gt;A platform with a full semantic layer, like Dremio, provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Virtual datasets (SQL views)&lt;/strong&gt; that define business logic across federated sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wikis&lt;/strong&gt; that document tables and columns in human- and AI-readable format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Labels&lt;/strong&gt; that tag data for governance (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-Grained Access Control&lt;/strong&gt; that enforces row/column security at the view level&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt; that automatically optimize performance for the most-queried views&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI-generated metadata&lt;/strong&gt; that auto-populates descriptions and label suggestions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compare that to a standalone metrics layer, which gives you metric definitions and (sometimes) basic documentation. The metrics layer is the engine. The semantic layer is the complete vehicle.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/03/when-to-choose.png&quot; alt=&quot;Choosing between a metrics layer and a full semantic layer&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;If you already have a metrics layer, audit what&apos;s missing. Do your metric definitions include documentation? Labels? Security policies? Join paths? If not, you have a piece of the semantic layer, not the whole thing.&lt;/p&gt;
&lt;p&gt;Completing the picture means either extending your metrics layer with those capabilities, or adopting a platform that provides them natively.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Design Reliable Data Pipelines</title><link>https://iceberglakehouse.com/posts/2026-02-debp-design-data-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-design-data-pipelines/</guid><description>
![Data pipeline architecture with four layers flowing from ingestion through staging, transformation, and serving](/assets/images/debp/02/pipeline-ar...</description><pubDate>Wed, 18 Feb 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/02/pipeline-architecture.png&quot; alt=&quot;Data pipeline architecture with four layers flowing from ingestion through staging, transformation, and serving&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most pipeline failures aren&apos;t caused by bad code. They&apos;re caused by no architecture. A script that reads from an API, transforms JSON, and writes to a database works fine on day one. On day ninety it fails at 3 AM because the API changed its response format, and the only way to recover is to rerun the entire pipeline from scratch : hoping that reprocessing three months of data doesn&apos;t create duplicates.&lt;/p&gt;
&lt;p&gt;Reliable pipelines are designed, not debugged into existence.&lt;/p&gt;
&lt;h2&gt;Reliability Is a Design Property, Not a Bug-Fix&lt;/h2&gt;
&lt;p&gt;You don&apos;t make a pipeline reliable by adding try-catch blocks after it breaks. You make it reliable by building reliability into the architecture from the start. That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Resumability.&lt;/strong&gt; After a failure, you restart from where it stopped, not from the beginning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idempotency.&lt;/strong&gt; Running the same job twice produces the same result.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability.&lt;/strong&gt; You know what the pipeline processed, how long it took, and where it is right now.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Isolation.&lt;/strong&gt; One failing stage doesn&apos;t cascade into unrelated stages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These properties don&apos;t come from choosing the right framework. They come from how you structure the pipeline.&lt;/p&gt;
&lt;h2&gt;The Four Architecture Layers&lt;/h2&gt;
&lt;p&gt;Every well-designed pipeline has four distinct layers, even if they run in the same job:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ingestion.&lt;/strong&gt; Pull raw data from sources and land it unchanged. Don&apos;t transform here. Don&apos;t filter. Don&apos;t join. Store the raw data exactly as it arrived, with metadata (timestamp, source, batch ID). This gives you a replayable audit trail.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Staging.&lt;/strong&gt; Validate the raw data. Check for schema compliance, null values in required fields, duplicate records, and data type mismatches. Records that fail validation go to a quarantine table or dead-letter queue : they don&apos;t silently disappear.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transformation.&lt;/strong&gt; Apply business logic: joins, aggregations, calculations, enrichments. This is where raw events become metrics, where customer records merge across sources, where timestamps convert to business periods. Keep business logic in one layer, not spread across ingestion and loading scripts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Serving.&lt;/strong&gt; Organize the transformed data for consumers. Analysts need star schemas. ML models need feature tables. APIs need denormalized lookups. The serving layer shapes data for its audience without changing the underlying transformation logic.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/02/four-layers.png&quot; alt=&quot;Stages: ingest raw data, validate in staging, apply business logic, serve to consumers&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Build a DAG, Not a Script&lt;/h2&gt;
&lt;p&gt;A script runs steps in order: step 1, step 2, step 3. If step 2 fails, you rerun from step 1. If step 3 needs a new input, you rewrite the script.&lt;/p&gt;
&lt;p&gt;A directed acyclic graph (DAG) models dependencies explicitly. Step 3 depends on step 2 and step 4. Step 2 and step 4 can run in parallel. If step 2 fails, you rerun step 2 : not steps 1, 4, or 3.&lt;/p&gt;
&lt;p&gt;DAG-based thinking gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Parallelism.&lt;/strong&gt; Independent stages run concurrently, cutting wall-clock time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeted retries.&lt;/strong&gt; Failed stages retry alone, not the entire workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clear dependencies.&lt;/strong&gt; You can see exactly what feeds into a given output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Incremental development.&lt;/strong&gt; Add new stages without touching existing ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even if your orchestrator doesn&apos;t enforce DAGs, design your pipeline as one. Document which stages depend on which outputs. Make each stage read from a defined input location and write to a defined output location.&lt;/p&gt;
&lt;h2&gt;Dependency Management&lt;/h2&gt;
&lt;p&gt;Implicit dependencies are the most common source of pipeline fragility. &amp;quot;This pipeline assumes table X exists because another pipeline created it&amp;quot; is an implicit dependency. When the other pipeline is delayed, skipped, or renamed, your pipeline breaks.&lt;/p&gt;
&lt;p&gt;Make dependencies explicit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Declare data dependencies.&lt;/strong&gt; If stage B reads the output of stage A, model that relationship in your orchestration. Don&apos;t rely on timing (&amp;quot;A usually finishes by 6 AM&amp;quot;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use sensors or triggers.&lt;/strong&gt; Wait for data to arrive before starting a stage. Check for a file, a partition, or a row count : don&apos;t check the clock.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version your interfaces.&lt;/strong&gt; When a producer changes its output schema, consumers should detect the change before they process stale or malformed data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document ownership.&lt;/strong&gt; Every dataset should have an owner. When you depend on someone else&apos;s table, you should know who to contact when it changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Failure Handling Patterns&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Retry with backoff.&lt;/strong&gt; Most transient failures (network timeouts, API throttling, lock contention) resolve themselves. Retry 3-5 times with exponential backoff (e.g., 1s, 5s, 25s) before marking a stage as failed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dead-letter queues.&lt;/strong&gt; Records that cannot be processed (corrupt payloads, unexpected schemas, values out of range) go to a quarantine area. Log why they failed. Review them periodically. Don&apos;t drop them silently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Circuit breakers.&lt;/strong&gt; If a downstream system returns errors consistently, stop sending requests after N failures. Resume with a health check. This prevents cascading failures and buffer exhaustion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Checkpointing.&lt;/strong&gt; After processing each batch or partition, record what was completed. On failure, resume from the last checkpoint. This is the difference between a 5-minute recovery and a 5-hour reprocessing job.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/02/failure-patterns.png&quot; alt=&quot;Failure handling: retry, dead-letter queue, circuit breaker, checkpoint&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Map your current pipelines against the four architecture layers. Identify which layers are missing or mixed together. The most common gap: ingestion and transformation are in the same script, making it impossible to replay raw data or isolate failures. Separate them, and reliability follows.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Conceptual, Logical, and Physical Data Models Explained</title><link>https://iceberglakehouse.com/posts/2026-02-dm-types-of-data-models/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-types-of-data-models/</guid><description>
![Three layers of data modeling from business concepts to database implementation](/assets/images/data_modeling/02/types-of-data-models.png)

Most da...</description><pubDate>Wed, 18 Feb 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/02/types-of-data-models.png&quot; alt=&quot;Three layers of data modeling from business concepts to database implementation&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most data teams jump straight from a stakeholder request to creating database tables. They skip the planning steps that prevent misalignment, redundancy, and rework. The result: tables that make sense to the engineer who built them but confuse everyone else.&lt;/p&gt;
&lt;p&gt;Data modeling addresses this by working at three levels of abstraction. Each level answers a different question, for a different audience, at a different stage of the design process.&lt;/p&gt;
&lt;h2&gt;Why Three Levels Exist&lt;/h2&gt;
&lt;p&gt;A single data model can&apos;t serve every purpose. Business stakeholders need to see what data the system captures and how concepts relate. Data architects need to define precise structures, data types, and rules. Database engineers need to optimize storage and performance for a specific platform.&lt;/p&gt;
&lt;p&gt;Trying to capture all of this in one diagram creates a document that&apos;s too abstract for engineers and too technical for the business. Three levels solve this by separating concerns.&lt;/p&gt;
&lt;h2&gt;The Conceptual Data Model&lt;/h2&gt;
&lt;p&gt;A conceptual data model defines the big picture. It identifies the major entities your system needs to track and the relationships between them.&lt;/p&gt;
&lt;p&gt;For an e-commerce platform, a conceptual model might look like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Customer&lt;/strong&gt; places &lt;strong&gt;Order&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order&lt;/strong&gt; contains &lt;strong&gt;Line Item&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Line Item&lt;/strong&gt; references &lt;strong&gt;Product&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product&lt;/strong&gt; belongs to &lt;strong&gt;Category&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are no column names, no data types, no keys. The conceptual model exists to answer one question: &amp;quot;Do we agree on what data the system needs?&amp;quot;&lt;/p&gt;
&lt;p&gt;This model is created collaboratively with business stakeholders. Its value is alignment. When the finance team says &amp;quot;customer&amp;quot; and the marketing team says &amp;quot;customer,&amp;quot; the conceptual model ensures they mean the same thing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skip this level&lt;/strong&gt;, and you build a database that captures the wrong entities or misses key relationships. Fixing structural errors after the database is in production costs 10x more than catching them at conception.&lt;/p&gt;
&lt;h2&gt;The Logical Data Model&lt;/h2&gt;
&lt;p&gt;The logical model adds precision to the conceptual model. It defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Attributes&lt;/strong&gt; for each entity (customer_id, customer_name, email, signup_date)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data types&lt;/strong&gt; (INTEGER, VARCHAR(255), DATE)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Primary keys&lt;/strong&gt; (customer_id uniquely identifies each customer)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Foreign keys&lt;/strong&gt; (order.customer_id references customer.customer_id)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Normalization rules&lt;/strong&gt; (eliminate redundancy up to Third Normal Form)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The logical model is intentionally DBMS-independent. It works whether you implement it in PostgreSQL, MySQL, Snowflake, or Apache Iceberg tables. This separation matters because it lets you evaluate the design on its own merits before committing to a specific technology.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/02/logical-model-detail.png&quot; alt=&quot;Logical model showing entities with attributes, data types, and relationship keys&quot;&gt;&lt;/p&gt;
&lt;p&gt;Normalization is the primary discipline at this level. The logical model eliminates data redundancy by splitting entities into their most atomic forms. A customer&apos;s address doesn&apos;t live in the orders table : it lives in its own table, referenced by a foreign key.&lt;/p&gt;
&lt;h2&gt;The Physical Data Model&lt;/h2&gt;
&lt;p&gt;The physical model translates the logical model into the exact implementation for a specific database engine. This is where theoretical design meets operational reality.&lt;/p&gt;
&lt;p&gt;A physical model specifies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Table names and column definitions (&lt;code&gt;customers&lt;/code&gt;, &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;line_items&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Data types specific to the DBMS (&lt;code&gt;BIGINT&lt;/code&gt; vs. &lt;code&gt;INTEGER&lt;/code&gt;, &lt;code&gt;TIMESTAMP_TZ&lt;/code&gt; vs. &lt;code&gt;TIMESTAMP&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Indexes for query performance (B-tree on &lt;code&gt;customer_id&lt;/code&gt;, hash on &lt;code&gt;email&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Partitioning strategies (partition &lt;code&gt;orders&lt;/code&gt; by &lt;code&gt;order_date&lt;/code&gt; using monthly ranges)&lt;/li&gt;
&lt;li&gt;Compression and file format choices (Parquet with Snappy compression for Iceberg)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The physical model is where performance tuning happens. You might denormalize at this level :  joining the customer name into the orders table to avoid an expensive join at query time ,  even though the logical model keeps them separate.&lt;/p&gt;
&lt;p&gt;In a lakehouse architecture, the physical model also includes Iceberg table properties: partition specs (time-based or value-based), sort orders for query optimization, and file format settings.&lt;/p&gt;
&lt;h2&gt;How the Three Levels Connect&lt;/h2&gt;
&lt;p&gt;Each level feeds the next:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Conceptual&lt;/th&gt;
&lt;th&gt;Logical&lt;/th&gt;
&lt;th&gt;Physical&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Abstraction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business stakeholders&lt;/td&gt;
&lt;td&gt;Data architects&lt;/td&gt;
&lt;td&gt;Database engineers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Entities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Named&lt;/td&gt;
&lt;td&gt;Defined with attributes&lt;/td&gt;
&lt;td&gt;Tables with typed columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Relationships&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Named&lt;/td&gt;
&lt;td&gt;With cardinality and keys&lt;/td&gt;
&lt;td&gt;Foreign key constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data types&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Generic (INTEGER, VARCHAR)&lt;/td&gt;
&lt;td&gt;DBMS-specific (BIGINT, TEXT)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Normalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not applicable&lt;/td&gt;
&lt;td&gt;Applied (3NF)&lt;/td&gt;
&lt;td&gt;May denormalize for performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not considered&lt;/td&gt;
&lt;td&gt;Not considered&lt;/td&gt;
&lt;td&gt;Indexes, partitions, caching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt;, you can implement all three levels using virtual datasets organized in a Medallion Architecture. Bronze views represent the physical layer (raw data mapped to typed columns). Silver views represent the logical layer (joins, business keys, normalized relationships). Gold views represent the conceptual layer (business entities ready for consumption, documented with Wikis and tagged with Labels).&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Skipping the conceptual model.&lt;/strong&gt; Engineers jump to table creation and miss requirement gaps that surface months later when a stakeholder asks &amp;quot;Why don&apos;t we track X?&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Building logical models tied to a DBMS.&lt;/strong&gt; If your logical model includes PostgreSQL-specific syntax, it&apos;s a physical model disguised as a logical one. This makes migration and evaluation harder.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-normalizing for analytics.&lt;/strong&gt; Third Normal Form is correct for transactional systems. But analytics workloads benefit from wider, flatter tables that reduce join counts. Know when to denormalize.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Under-documenting all levels.&lt;/strong&gt; A model without documentation is a puzzle. Column names like &lt;code&gt;c_id&lt;/code&gt;, &lt;code&gt;dt&lt;/code&gt;, and &lt;code&gt;amt&lt;/code&gt; save keystrokes and cost hours of confusion.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/02/data-model-stakeholders.png&quot; alt=&quot;Data models feeding into AI, dashboards, and governance systems&quot;&gt;&lt;/p&gt;
&lt;p&gt;Audit your current data platform against all three levels. Can you show a business stakeholder what entities your system tracks (conceptual)? Can you show an architect the precise attributes and relationships (logical)? Can you explain why the tables are partitioned and indexed the way they are (physical)?&lt;/p&gt;
&lt;p&gt;If any of those questions draws a blank, you have a gap worth filling.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Build a Semantic Layer: A Step-by-Step Guide</title><link>https://iceberglakehouse.com/posts/2026-02-sl-how-to-build-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-how-to-build-semantic-layer/</guid><description>
![Building a semantic layer : Bronze, Silver, and Gold tiers](/assets/images/semantic_layer/02/build-semantic-layer.png)

Most teams start building a...</description><pubDate>Wed, 18 Feb 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/02/build-semantic-layer.png&quot; alt=&quot;Building a semantic layer : Bronze, Silver, and Gold tiers&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most teams start building a semantic layer the wrong way: they open their BI tool, create a few calculated fields, and call it done. Six months later, three dashboards define &amp;quot;churn&amp;quot; differently, nobody trusts the numbers, and the data team is debugging metric discrepancies instead of building new features.&lt;/p&gt;
&lt;p&gt;A well-built semantic layer prevents all of that. Here&apos;s how to do it right.&lt;/p&gt;
&lt;h2&gt;Start With Metrics, Not Data Models&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/02/metric-alignment.png&quot; alt=&quot;Stakeholders aligning on unified metric definitions&quot;&gt;&lt;/p&gt;
&lt;p&gt;Before writing a single line of SQL, sit down with stakeholders from Sales, Finance, Marketing, and Product. Agree on the top 5-10 business metrics your organization uses to make decisions.&lt;/p&gt;
&lt;p&gt;For each metric, document:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The calculation&lt;/strong&gt;: Revenue = SUM(order_total) WHERE status = &apos;completed&apos; AND refunded = FALSE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The owner&lt;/strong&gt;: Who is accountable for this definition?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The grain&lt;/strong&gt;: Daily? Monthly? Per customer?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The refresh cadence&lt;/strong&gt;: Real-time? Daily batch? Weekly?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This exercise is harder than it sounds. You will discover that &amp;quot;Monthly Active Users&amp;quot; has three competing definitions. That&apos;s the point. The semantic layer can&apos;t resolve disagreements that haven&apos;t been surfaced yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: A metric glossary. This becomes the source document for everything you build next.&lt;/p&gt;
&lt;h2&gt;Map Your Data Sources&lt;/h2&gt;
&lt;p&gt;Inventory every system that feeds into your analytics:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source Type&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transactional databases&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL, SQL Server&lt;/td&gt;
&lt;td&gt;Federated query (read-only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud data lakes&lt;/td&gt;
&lt;td&gt;S3 (Parquet/Iceberg), Azure Data Lake&lt;/td&gt;
&lt;td&gt;Direct scan or catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SaaS platforms&lt;/td&gt;
&lt;td&gt;Salesforce, HubSpot, Stripe&lt;/td&gt;
&lt;td&gt;API extraction or replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spreadsheets&lt;/td&gt;
&lt;td&gt;Google Sheets, Excel&lt;/td&gt;
&lt;td&gt;One-time import or scheduled sync&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Not all sources need to be replicated into a central store. Federation lets you query data where it lives without the cost and complexity of ETL pipelines. Platforms like &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; connect to dozens of sources and present them in a single namespace, so your semantic layer can span everything without data movement.&lt;/p&gt;
&lt;h2&gt;Design the Three-Layer View Structure&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/02/three-layer-arch.png&quot; alt=&quot;Bronze, Silver, and Gold data layers in the Medallion Architecture&quot;&gt;&lt;/p&gt;
&lt;p&gt;The most effective semantic layer architecture uses three layers of SQL views, commonly called the Medallion Architecture.&lt;/p&gt;
&lt;h3&gt;Bronze Layer (Preparation)&lt;/h3&gt;
&lt;p&gt;Create one view per raw source table. Apply no business logic. Just make the data human-readable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rename cryptic columns: &lt;code&gt;col_7&lt;/code&gt; → &lt;code&gt;OrderDate&lt;/code&gt;, &lt;code&gt;cust_id&lt;/code&gt; → &lt;code&gt;CustomerID&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Cast types to standard formats: strings to dates, integers to decimals&lt;/li&gt;
&lt;li&gt;Normalize timestamps to UTC&lt;/li&gt;
&lt;li&gt;Avoid using SQL reserved words as column names (&lt;code&gt;Timestamp&lt;/code&gt;, &lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;Role&lt;/code&gt; will force double-quoting in every downstream query. Use &lt;code&gt;EventTimestamp&lt;/code&gt;, &lt;code&gt;TransactionDate&lt;/code&gt;, &lt;code&gt;UserRole&lt;/code&gt; instead.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Bronze views should be boring. Their only job is to make raw data safe to work with.&lt;/p&gt;
&lt;h3&gt;Silver Layer (Business Logic)&lt;/h3&gt;
&lt;p&gt;This is where your metric glossary becomes code. Silver views join Bronze views, deduplicate records, filter invalid data, and apply business rules.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW silver.orders_enriched AS
SELECT
    o.OrderID,
    o.OrderDate,
    o.Total AS OrderTotal,
    c.Region,
    c.Segment
FROM bronze.orders_raw o
JOIN bronze.customers_raw c ON o.CustomerID = c.CustomerID
WHERE o.Total &amp;gt; 0 AND o.Status = &apos;completed&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each Silver view encodes exactly one business concept. &amp;quot;Revenue&amp;quot; is defined in one place. Every dashboard, notebook, and AI agent that needs revenue queries this view. No exceptions.&lt;/p&gt;
&lt;h3&gt;Gold Layer (Application)&lt;/h3&gt;
&lt;p&gt;Gold views are pre-aggregated for specific consumers. A BI dashboard gets &lt;code&gt;monthly_revenue_by_region&lt;/code&gt;. An AI agent gets &lt;code&gt;customer_360_summary&lt;/code&gt;. A finance report gets &lt;code&gt;quarterly_financial_summary&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Gold views don&apos;t add new business logic. They aggregate and reshape Silver views for performance and usability.&lt;/p&gt;
&lt;h2&gt;Document Everything : or Let AI Help&lt;/h2&gt;
&lt;p&gt;An undocumented semantic layer is a semantic layer nobody uses. Every table and every column should have a description that explains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What the data represents&lt;/li&gt;
&lt;li&gt;Where it comes from&lt;/li&gt;
&lt;li&gt;Any known limitations or caveats&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is tedious work. Modern platforms accelerate it with AI. Dremio&apos;s generative AI, for example, can auto-generate Wiki descriptions by sampling table data, and suggest Labels (tags like &amp;quot;PII,&amp;quot; &amp;quot;Finance,&amp;quot; &amp;quot;Certified&amp;quot;) for governance and discoverability. The AI provides a 70% first draft. Your data team fills in the domain-specific context.&lt;/p&gt;
&lt;p&gt;This documentation serves two audiences: human analysts browsing the catalog, and AI agents that need context to generate accurate SQL. Both benefit from rich, accurate descriptions.&lt;/p&gt;
&lt;h2&gt;Enforce Access Policies at the Layer&lt;/h2&gt;
&lt;p&gt;Security should be embedded in the semantic layer, not applied after the fact in each tool. Two patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Security&lt;/strong&gt;: Filter what data a user can see based on their role. A regional manager sees only their region&apos;s data. The SQL view applies the filter automatically.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column Masking&lt;/strong&gt;: Mask sensitive columns (SSN, email, salary) for roles that don&apos;t need them. Analysts see &lt;code&gt;****@email.com&lt;/code&gt;. Data engineers see the full value.&lt;/p&gt;
&lt;p&gt;The advantage of enforcing policies at the semantic layer: every downstream query inherits the rules, whether the query comes from a dashboard, a Python notebook, or an AI agent. No gaps.&lt;/p&gt;
&lt;h2&gt;Start Small, Then Expand&lt;/h2&gt;
&lt;p&gt;Don&apos;t try to model your entire data landscape on day one. Start with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3-5 core metrics from your glossary&lt;/li&gt;
&lt;li&gt;The 2-3 source systems those metrics depend on&lt;/li&gt;
&lt;li&gt;One Bronze → Silver → Gold pipeline per metric&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Validate by running the same question across two different tools (a BI dashboard and a SQL notebook, for example). If both return the same number, the semantic layer is working. If they don&apos;t, fix the Silver view definition before adding more.&lt;/p&gt;
&lt;p&gt;Once the first metrics are stable, expand incrementally. Add new sources, new Silver views, new Gold views. Each addition is low-risk because the layered structure isolates changes.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick the metric your organization argues about the most. Define it explicitly in a Silver view. Test it against the current dashboards. If the numbers match, you&apos;ve validated the approach. If they don&apos;t, you&apos;ve just found the inconsistency that&apos;s been silently costing your organization trust.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Think Like a Data Engineer</title><link>https://iceberglakehouse.com/posts/2026-02-debp-think-like-data-engineer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-think-like-data-engineer/</guid><description>
![Data flowing through a system of interconnected pipeline stages from sources to consumers](/assets/images/debp/01/data-engineer-mindset.png)

The m...</description><pubDate>Wed, 18 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/01/data-engineer-mindset.png&quot; alt=&quot;Data flowing through a system of interconnected pipeline stages from sources to consumers&quot;&gt;&lt;/p&gt;
&lt;p&gt;The median lifespan of a popular data tool is about three years. The tool you master today may be deprecated or replaced by the time your next project ships. What doesn&apos;t change are the principles underneath: how data flows, how systems fail, how contracts between producers and consumers work, and how to decompose messy requirements into clean, maintainable pipelines.&lt;/p&gt;
&lt;p&gt;Thinking like a data engineer means solving problems at the systems level, not the tool level. It means asking &amp;quot;what could go wrong?&amp;quot; before asking &amp;quot;what framework should I use?&amp;quot;&lt;/p&gt;
&lt;h2&gt;Tools Change : Principles Don&apos;t&lt;/h2&gt;
&lt;p&gt;Every year brings a new orchestrator, a new streaming framework, a new columnar format. Teams that build their expertise around a specific tool struggle when the landscape shifts. Teams that build expertise around principles :  idempotency, schema contracts, data quality at the source, composable stages ,  adopt new tools without starting over.&lt;/p&gt;
&lt;p&gt;The question is never &amp;quot;How do I do this in Tool X?&amp;quot; The question is &amp;quot;What problem am I solving, and what properties does the solution need to have?&amp;quot; Once you answer that, the tool choice becomes a constraint-matching exercise.&lt;/p&gt;
&lt;h2&gt;The Five Questions Framework&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/01/five-questions-framework.png&quot; alt=&quot;Five-question framework: Sources, Destinations, Transformations, Failure Modes, Monitoring&quot;&gt;&lt;/p&gt;
&lt;p&gt;Before designing any pipeline, answer five questions:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. What data exists?&lt;/strong&gt; Identify every source: databases, APIs, event streams, files. Note the format (JSON, CSV, Parquet, Avro), volume (rows per day), freshness (real-time, hourly, daily), and reliability (does this source go down?).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Where does it need to go?&lt;/strong&gt; Identify every consumer: dashboards, ML models, downstream systems, analysts. Note what format they need, how fresh the data must be, and what SLAs they expect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. What transformations are needed?&lt;/strong&gt; Map the gap between source shape and consumer shape. This includes cleaning (nulls, duplicates, encoding), enriching (joining lookup data), and aggregating (daily summaries, running totals).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. What can go wrong?&lt;/strong&gt; List failure modes: late data, schema changes in the source, duplicate events, null values in required fields, API rate limits, network partitions, out-of-order events. For each failure mode, define the expected behavior : skip, retry, alert, or quarantine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. How will you know if it&apos;s working?&lt;/strong&gt; Define observability: row counts in vs. row counts out, freshness checks, schema validation, anomaly detection. If you can&apos;t answer this question before building the pipeline, you&apos;ll be debugging in production.&lt;/p&gt;
&lt;h2&gt;Think in Systems, Not Scripts&lt;/h2&gt;
&lt;p&gt;A script processes data from A to B. A system handles what happens when A is late, B is down, the data shape changes, the volume doubles, and the on-call engineer needs to understand what happened at 3 AM.&lt;/p&gt;
&lt;p&gt;Thinking in systems means:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Composability.&lt;/strong&gt; Break pipelines into discrete stages that can be developed, tested, and monitored independently. An ingestion stage should not also handle transformation and loading. When a stage fails, you restart that stage, not the entire pipeline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Contracts.&lt;/strong&gt; Define what each stage produces: column names, data types, value ranges, freshness guarantees. When a producer changes its output, the contract violation is caught immediately : not three stages downstream when a dashboard shows wrong numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;State management.&lt;/strong&gt; Track what has been processed. Know where to resume after a failure. Avoid reprocessing data unnecessarily by maintaining checkpoints, watermarks, or change data capture (CDC) positions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Isolation.&lt;/strong&gt; One failing pipeline should not take down others. Shared resources (connection pools, compute clusters, storage) need limits per-pipeline to prevent noisy-neighbor problems.&lt;/p&gt;
&lt;h2&gt;Design for Failure First&lt;/h2&gt;
&lt;p&gt;The default assumption should be: every component will fail. Networks drop. APIs return errors. Source schemas change without warning. Storage fills up. The pipeline that handles none of these cases works in development and breaks in production.&lt;/p&gt;
&lt;p&gt;Practical failure-first design:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Retry with backoff.&lt;/strong&gt; Transient errors (network timeouts, API rate limits) often resolve themselves. Retry with exponential backoff before alerting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dead-letter queues.&lt;/strong&gt; Records that can&apos;t be processed (malformed, unexpected schema) go to a separate queue for inspection : not dropped silently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idempotent writes.&lt;/strong&gt; Running a pipeline job twice should produce the same end-state. Use upserts, deduplication, or transaction-based writes instead of blind appends.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Circuit breakers.&lt;/strong&gt; If a downstream system is unresponsive, stop sending data after N failures instead of filling up buffers and crashing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Anti-Patterns That Signal Inexperience&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choosing the tool before understanding the problem.&lt;/strong&gt; &amp;quot;We should use Kafka&amp;quot; is not a good starting point. &amp;quot;We need sub-second event delivery with at-least-once guarantees&amp;quot; is. The tool choice follows from the requirements, not the other way around.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monolithic pipelines.&lt;/strong&gt; One script that reads from a database, cleans data, joins three tables, aggregates, and writes to a warehouse. When any step fails, the entire pipeline fails. When any step needs a change, the entire pipeline needs retesting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No error handling.&lt;/strong&gt; &lt;code&gt;try: process() except: pass&lt;/code&gt; is not error handling. Every expected failure mode should have an explicit response: retry, skip and log, alert, or halt.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No monitoring.&lt;/strong&gt; If the only way you learn about a pipeline failure is when an analyst asks &amp;quot;why is the dashboard empty?&amp;quot;, your observability is broken.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/01/anti-patterns.png&quot; alt=&quot;Anti-patterns: monolithic pipeline, no monitoring, tool-first thinking&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your most critical pipeline. Walk through the Five Questions Framework. Can you answer all five clearly and completely? If not, the gaps are your immediate priorities. Write down the answers, share them with your team, and use them as the specification for your next refactor.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Is Data Modeling? A Complete Guide</title><link>https://iceberglakehouse.com/posts/2026-02-dm-what-is-data-modeling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-what-is-data-modeling/</guid><description>
![Data entities connected by relationship lines forming a structured data model](/assets/images/data_modeling/01/data-modeling-overview.png)

Every d...</description><pubDate>Wed, 18 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/01/data-modeling-overview.png&quot; alt=&quot;Data entities connected by relationship lines forming a structured data model&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every database, data warehouse, and data lakehouse starts with the same question: how should this data be organized? Data modeling answers that question by creating a structured blueprint of your data : what it contains, how it relates, and what it means.&lt;/p&gt;
&lt;p&gt;A data model is not a diagram you draw once and forget. It&apos;s a living definition of your business logic, encoded in the structure of your tables, columns, and relationships. Get it right, and every downstream consumer :  dashboards, reports, AI agents, applications ,  works from the same shared understanding. Get it wrong, and you spend months untangling conflicting definitions of &amp;quot;customer,&amp;quot; &amp;quot;revenue,&amp;quot; and &amp;quot;active user.&amp;quot;&lt;/p&gt;
&lt;h2&gt;What Data Modeling Actually Means&lt;/h2&gt;
&lt;p&gt;Data modeling is the process of defining entities, attributes, and relationships for a dataset. Entities represent real-world objects or concepts (Customers, Orders, Products). Attributes describe those entities (customer name, order date, product price). Relationships define how entities connect (a customer &lt;em&gt;places&lt;/em&gt; an order, an order &lt;em&gt;contains&lt;/em&gt; products).&lt;/p&gt;
&lt;p&gt;The goal is to create a representation precise enough that a database can store the data reliably, and clear enough that a human :  or an AI agent ,  can understand what the data means.&lt;/p&gt;
&lt;p&gt;Think of it as an architectural blueprint. You wouldn&apos;t build a house without one, and you shouldn&apos;t build a data platform without a data model.&lt;/p&gt;
&lt;h2&gt;The Three Levels of Data Modeling&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/01/three-levels-data-model.png&quot; alt=&quot;Conceptual, logical, and physical data models as three layers of increasing detail&quot;&gt;&lt;/p&gt;
&lt;p&gt;Data models operate at three levels of abstraction, each serving a different audience:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Audience&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Contains&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Conceptual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business stakeholders&lt;/td&gt;
&lt;td&gt;Define &lt;em&gt;what&lt;/em&gt; data is needed&lt;/td&gt;
&lt;td&gt;Entities, relationships, business rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data architects&lt;/td&gt;
&lt;td&gt;Define &lt;em&gt;how&lt;/em&gt; data is structured&lt;/td&gt;
&lt;td&gt;Attributes, data types, normalization rules, keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Physical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database engineers&lt;/td&gt;
&lt;td&gt;Define &lt;em&gt;where and how&lt;/em&gt; data is stored&lt;/td&gt;
&lt;td&gt;Tables, columns, indexes, partitions, constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Conceptual models&lt;/strong&gt; capture business requirements without technical details. A conceptual model might say &amp;quot;Customers place Orders, and Orders contain Products.&amp;quot; It doesn&apos;t specify column types or index strategies. Its job is to align business stakeholders and technical teams on what data the system needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Logical models&lt;/strong&gt; add precision. They define attributes (customer_id, customer_name, email), assign data types (INTEGER, VARCHAR, TIMESTAMP), and specify normalization rules. A logical model is independent of any specific database engine : it works whether you implement it in PostgreSQL, Snowflake, or Apache Iceberg.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Physical models&lt;/strong&gt; are implementation-specific. They define table names, column types for a specific DBMS, primary and foreign keys, indexes for query performance, and partitioning strategies. This is where theoretical design meets operational reality : storage formats, compression codecs, and file organization all matter here.&lt;/p&gt;
&lt;h2&gt;Common Data Modeling Techniques&lt;/h2&gt;
&lt;p&gt;Several techniques exist for organizing data. Each fits different use cases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Entity-Relationship (ER) Modeling&lt;/strong&gt; is the most widely used technique for transactional systems. It maps entities, attributes, and their relationships using formal diagrams. Most OLTP databases :  the systems that power applications ,  start with an ER model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dimensional Modeling&lt;/strong&gt; organizes data into facts (measurable events like sales transactions) and dimensions (context like date, product, and customer). Star schemas and snowflake schemas are the two primary patterns. This technique dominates data warehousing and analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Vault Modeling&lt;/strong&gt; separates data into Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with history). It&apos;s designed for environments where sources change frequently and full audit history matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Graph Modeling&lt;/strong&gt; represents data as nodes (entities) and edges (relationships). It&apos;s useful when the relationships between data points are as important as the data itself : social networks, recommendation engines, fraud detection.&lt;/p&gt;
&lt;h2&gt;Why Data Modeling Matters More Than Ever&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/01/data-model-downstream.png&quot; alt=&quot;Data model feeding into dashboards, AI agents, and governance systems&quot;&gt;&lt;/p&gt;
&lt;p&gt;Three trends have made data modeling more critical, not less:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI needs structure to be accurate.&lt;/strong&gt; When an AI agent generates SQL, it relies on well-defined tables, clear column names, and documented relationships. A poorly modeled dataset forces the agent to guess which table contains &amp;quot;revenue&amp;quot; and which join path connects &amp;quot;customers&amp;quot; to &amp;quot;orders.&amp;quot; Those guesses create hallucinated queries that return wrong numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-service analytics depends on understandable data.&lt;/strong&gt; Business users exploring data in a BI tool can only self-serve if the data model is intuitive. When tables are named &lt;code&gt;stg_src_cust_v2_final&lt;/code&gt; with columns like &lt;code&gt;c1&lt;/code&gt;, &lt;code&gt;c2&lt;/code&gt;, &lt;code&gt;c3&lt;/code&gt;, even experienced analysts give up and file a ticket instead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compliance requires traceable definitions.&lt;/strong&gt; Regulations like GDPR and CCPA demand that organizations know what personal data they store, where it flows, and who can access it. A well-documented data model provides that traceability. Without one, compliance audits turn into archaeology projects.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; address this by letting you implement data models as virtual datasets (SQL views) organized in a Medallion Architecture : Bronze for raw data preparation, Silver for business logic and joins, Gold for application-specific outputs. The model exists as a logical layer without requiring physical data copies, and Wikis, Labels, and Fine-Grained Access Control add documentation and governance directly to the model.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your five most-queried tables. For each one, answer three questions: What does each column mean? How does this table connect to other tables? Who is allowed to see which rows? If you can&apos;t answer all three confidently, your data model has gaps.&lt;/p&gt;
&lt;p&gt;Filling those gaps means defining clear entities, documenting attributes, and specifying relationships : the core of data modeling.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Is a Semantic Layer? A Complete Guide</title><link>https://iceberglakehouse.com/posts/2026-02-sl-what-is-a-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-what-is-a-semantic-layer/</guid><description>
![Semantic layer concept : translating raw data into business terms](/assets/images/semantic_layer/01/semantic-layer-concept.png)

Ask three teams in...</description><pubDate>Wed, 18 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/01/semantic-layer-concept.png&quot; alt=&quot;Semantic layer concept : translating raw data into business terms&quot;&gt;&lt;/p&gt;
&lt;p&gt;Ask three teams in your company how they calculate &amp;quot;revenue&amp;quot; and you&apos;ll get three answers. Sales counts bookings. Finance counts recognized revenue. Marketing counts pipeline value. All three call it &amp;quot;revenue.&amp;quot; All three get different numbers. Nobody knows which one is right.&lt;/p&gt;
&lt;p&gt;This is the problem a semantic layer solves.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Actually Is&lt;/h2&gt;
&lt;p&gt;A semantic layer is a logical abstraction between your raw data and the people (or AI agents) querying it. It maps technical database objects :  tables, columns, join paths ,  to business-friendly terms like &amp;quot;Revenue,&amp;quot; &amp;quot;Active Customer,&amp;quot; or &amp;quot;Churn Rate.&amp;quot;&lt;/p&gt;
&lt;p&gt;It&apos;s not a database. It doesn&apos;t store data. It&apos;s a layer of definitions, calculations, and context that ensures every query against your data produces consistent results, regardless of which tool or person runs it.&lt;/p&gt;
&lt;p&gt;The concept isn&apos;t new. Business Objects introduced &amp;quot;universes&amp;quot; in the 1990s : metadata models that let users drag and drop business concepts instead of writing SQL. What&apos;s changed is scope. Modern semantic layers are universal (not tied to one BI tool), AI-aware (they provide context to language models), and governance-integrated (they enforce access policies alongside definitions).&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Contains&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/01/sl-components.png&quot; alt=&quot;Five key components of a semantic layer connected to a central hub&quot;&gt;&lt;/p&gt;
&lt;p&gt;A complete semantic layer includes six components:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Virtual datasets (Views)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL-defined business logic applied once and reused everywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Canonical calculations for KPIs (e.g., MRR = SUM of active subscription revenue)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human- and machine-readable descriptions of tables, columns, and relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Labels and tags&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Categorization for governance (PII, Finance) and discovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Join relationships&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-defined join paths so users don&apos;t need to know foreign keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access policies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Row-level security and column masking enforced at the layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key insight: these components serve both human analysts and AI agents. When an AI generates SQL from a natural language question, it consults this same layer to understand what &amp;quot;revenue&amp;quot; means, which tables to join, and which columns to filter.&lt;/p&gt;
&lt;h2&gt;How It Works in Practice&lt;/h2&gt;
&lt;p&gt;Here&apos;s what happens when someone queries data through a semantic layer:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A user (or AI agent) asks: &amp;quot;What was revenue by region last quarter?&amp;quot;&lt;/li&gt;
&lt;li&gt;The semantic layer translates:
&lt;ul&gt;
&lt;li&gt;&amp;quot;Revenue&amp;quot; → &lt;code&gt;SUM(orders.total) WHERE orders.status = &apos;completed&apos;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Region&amp;quot; → &lt;code&gt;customers.region&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Last quarter&amp;quot; → &lt;code&gt;WHERE order_date BETWEEN &apos;2025-10-01&apos; AND &apos;2025-12-31&apos;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The query engine generates optimized SQL against the underlying data sources&lt;/li&gt;
&lt;li&gt;Results are returned using business terms, not raw column names&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The user never writes SQL. The AI never guesses at column names. The metric definition is applied identically whether the query runs in a dashboard, a Python notebook, or a chat interface.&lt;/p&gt;
&lt;h2&gt;Why It Matters Now More Than Ever&lt;/h2&gt;
&lt;p&gt;Three trends are making semantic layers essential, not optional.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI agents need business context.&lt;/strong&gt; Large language models generating SQL will hallucinate column names, use incorrect aggregation logic, and join tables wrong unless they have explicit definitions to work from. A semantic layer provides that grounding. This is why platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio embed a semantic layer directly into the query engine&lt;/a&gt; : it&apos;s the context that makes the AI accurate instead of confidently wrong.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-service analytics demands accessibility.&lt;/strong&gt; Business users want to query data without filing a ticket. Exposing raw database schemas to non-technical users creates more problems than it solves. A semantic layer presents data in terms people already understand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance requires centralized definitions.&lt;/strong&gt; GDPR, CCPA, and industry regulations require organizations to know what data they have, who can access it, and how it&apos;s used. A semantic layer centralizes these definitions and enforces access policies in one place instead of across dozens of tools.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/01/without-vs-with.png&quot; alt=&quot;Without vs. with a semantic layer : from metric chaos to alignment&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Common Misconceptions&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;It&apos;s just a data catalog.&amp;quot;&lt;/strong&gt; A data catalog is an inventory : it tells you what data exists. A semantic layer defines what data &lt;em&gt;means&lt;/em&gt; and how to calculate it. You need both. They&apos;re complementary, not interchangeable. (See: Semantic Layer vs. Data Catalog)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;It&apos;s just a BI tool feature.&amp;quot;&lt;/strong&gt; Some BI tools include semantic models (Looker&apos;s LookML, Power BI&apos;s datasets). But these are tool-specific. If your organization uses three BI tools, you maintain three separate semantic models. A universal semantic layer defines metrics once and serves them to every tool.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;It adds a performance penalty.&amp;quot;&lt;/strong&gt; Modern semantic layers don&apos;t just translate queries : they optimize them. Dremio, for example, uses &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Reflections&lt;/a&gt; (pre-computed, physically optimized data copies) to accelerate queries that pass through its semantic layer. The result is often faster than querying raw tables directly.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your organization&apos;s five most important metrics. Ask two different teams how each one is calculated. If the answers don&apos;t match, that&apos;s your signal. You don&apos;t have a semantic layer problem : you have a trust problem, and a semantic layer is how you fix it.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A 2026 Introduction to Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2026-02-intro-to-Apache-Iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-intro-to-Apache-Iceberg/</guid><description>
Apache Iceberg is an open-source table format for large analytic datasets. It defines how data files stored on object storage (S3, ADLS, GCS) are org...</description><pubDate>Fri, 13 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Apache Iceberg is an open-source table format for large analytic datasets. It defines how data files stored on object storage (S3, ADLS, GCS) are organized into a logical table with a schema, partition layout, and consistent point-in-time snapshots. If you&apos;ve heard the term &amp;quot;data lakehouse,&amp;quot; Iceberg is the layer that makes it possible by bringing warehouse-grade reliability to data lake storage.&lt;/p&gt;
&lt;p&gt;This post covers what Iceberg is, how its metadata works under the hood, what changed across specification versions 1 through 3, what&apos;s being proposed for v4, and how to get started using Iceberg tables with &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio&lt;/a&gt; in about ten minutes.&lt;/p&gt;
&lt;h2&gt;Where Iceberg Came From&lt;/h2&gt;
&lt;p&gt;Before Iceberg, most data lake tables used the Hive table format. Hive tracks data by directory paths: one directory per partition, with files inside. That works fine for small tables, but it breaks down at scale. Listing files across thousands of partition directories takes minutes. Schema changes require careful coordination. There&apos;s no isolation between readers and writers, so concurrent queries can return inconsistent results.&lt;/p&gt;
&lt;p&gt;Netflix hit all of these problems in production around 2017. Ryan Blue and Dan Weeks designed Iceberg to solve them by tracking individual files instead of directories, using file-level metadata instead of a central metastore, and requiring atomic commits for every change. Netflix open-sourced the project, and it entered the Apache Incubator in 2018. By May 2020, Iceberg graduated to an Apache Top-Level Project. Today it&apos;s the de facto open table format, adopted by AWS, Google, Snowflake, Databricks, Dremio, Cloudera, and dozens of other vendors.&lt;/p&gt;
&lt;h2&gt;How Iceberg&apos;s Metadata Works&lt;/h2&gt;
&lt;p&gt;Iceberg replaces directory listings with a tree of metadata files. Each layer in the tree stores progressively finer details about the table&apos;s contents.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nw96gvuaqp4f6gi5leks.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Catalog Pointer:&lt;/strong&gt; The catalog (Polaris, Glue, Nessie, or any REST catalog implementation) stores a single pointer to the current metadata file. This is the entry point.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata File (JSON):&lt;/strong&gt; Contains the current schema, partition specs, sort orders, snapshot list, and table properties. Every write creates a new metadata file and atomically swaps the catalog pointer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manifest List (Avro):&lt;/strong&gt; One per snapshot. Lists all manifest files belonging to that snapshot, along with partition-level summary stats. Query engines use these stats to skip entire manifests that can&apos;t match a query&apos;s filter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manifest Files (Avro):&lt;/strong&gt; Each manifest tracks a set of data files and stores per-file statistics: file path, partition tuple, record count, and column-level min, max, and null counts. These stats enable file-level pruning during scan planning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Files (Parquet/ORC/Avro):&lt;/strong&gt; The actual rows, stored in columnar format. Iceberg itself is format-agnostic, though Parquet is the most common choice.&lt;/p&gt;
&lt;p&gt;This structure means scan planning is O(1) in metadata lookups rather than O(n) in partition directories. That&apos;s the core architectural advantage.&lt;/p&gt;
&lt;h2&gt;Spec Versions: V1 Through V3 (and V4 Proposals)&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/drqplx888rtwbfyo9czk.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Version 1: Analytic Tables (2017–2020)&lt;/h3&gt;
&lt;p&gt;V1 established the fundamentals: immutable data files, snapshot-based tracking, manifest-level file stats, hidden partitioning, and schema evolution via unique column IDs. Operations were limited to appends and full-partition overwrites.&lt;/p&gt;
&lt;h3&gt;Version 2: Row-Level Deletes (~2022)&lt;/h3&gt;
&lt;p&gt;V2 added delete files that encode which rows to remove from existing data files. Position delete files list specific (file, row-number) pairs. Equality delete files specify column values that identify deleted rows. This made UPDATE, DELETE, and MERGE possible without rewriting entire data files. V2 also introduced sequence numbers for ordering concurrent writes and resolving commit conflicts through optimistic concurrency.&lt;/p&gt;
&lt;h3&gt;Version 3: Extended Capabilities (May 2025)&lt;/h3&gt;
&lt;p&gt;V3 brought several major additions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Deletion Vectors:&lt;/strong&gt; Binary bitmaps stored in Puffin files that replace position deletes. More compact in storage and faster to apply during reads. At most one deletion vector per data file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Lineage:&lt;/strong&gt; Per-snapshot tracking of row-level identity (&lt;code&gt;first-row-id&lt;/code&gt;, &lt;code&gt;added-rows&lt;/code&gt;). This enables efficient change data capture (CDC) pipelines directly on Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New Data Types:&lt;/strong&gt; &lt;code&gt;variant&lt;/code&gt; for semi-structured data, &lt;code&gt;geometry&lt;/code&gt; and &lt;code&gt;geography&lt;/code&gt; for geospatial workloads, and nanosecond-precision timestamps (&lt;code&gt;timestamp_ns&lt;/code&gt;, &lt;code&gt;timestamptz_ns&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Default Values:&lt;/strong&gt; Columns can specify &lt;code&gt;write-default&lt;/code&gt; and &lt;code&gt;initial-default&lt;/code&gt; values, making schema evolution smoother.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Argument Transforms:&lt;/strong&gt; Partition and sort transforms can accept multiple input columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Encryption Keys:&lt;/strong&gt; Built-in support for encrypting data at rest.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Version 4: Active Proposals (2025–2026)&lt;/h3&gt;
&lt;p&gt;The community is actively discussing several changes for a future v4 spec:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Single-file commits&lt;/strong&gt; would consolidate all metadata changes into one file per commit, reducing I/O overhead for high-write workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parquet for metadata&lt;/strong&gt; would replace Avro-encoded metadata files with Parquet, enabling columnar reads of metadata (only load the fields you need).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relative path support&lt;/strong&gt; would store file references relative to the table root, simplifying table migration and replication without metadata rewrites.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved column statistics&lt;/strong&gt; would add more granular stats for better query planning and change detection.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Key Features Worth Knowing&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Every commit is atomic with serializable isolation. Readers never see partial writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Add, drop, rename, or reorder columns safely. Iceberg uses unique field IDs, so renaming a column doesn&apos;t break older data files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; Change your partitioning strategy without rewriting existing data. Old and new partition layouts coexist. Queries filter on data values, not partition columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Users query raw values (&lt;code&gt;WHERE order_date = &apos;2025-06-15&apos;&lt;/code&gt;). Iceberg applies transforms (&lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;, &lt;code&gt;bucket&lt;/code&gt;, &lt;code&gt;truncate&lt;/code&gt;) automatically. No synthetic partition columns in the schema.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query any previous snapshot by ID or timestamp. Roll back to a known-good state in one command.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Branching and Tagging:&lt;/strong&gt; Named references to specific snapshots, useful for write-audit-publish workflows and staging environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Engine Access:&lt;/strong&gt; The same Iceberg table is readable and writable from Spark, Flink, Trino, Dremio, DuckDB, Snowflake, BigQuery, Presto, and others.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Value of the REST Catalog Spec&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s REST Catalog Specification defines an HTTP API for table management. Any engine that speaks HTTP can create, list, read, and commit to Iceberg tables without importing a Java SDK. That&apos;s significant because it makes catalog access language-agnostic (Python, Rust, Go, JavaScript) and cloud-agnostic (AWS, GCP, Azure). It also enables server-side features like credential vending (short-lived storage tokens per request), commit deconfliction, and multi-table transactions.&lt;/p&gt;
&lt;p&gt;Several projects implement the REST Catalog spec: &lt;a href=&quot;https://polaris.apache.org/&quot;&gt;Apache Polaris&lt;/a&gt;, Project Nessie, Unity Catalog, AWS Glue (via adapter), and Snowflake Open Catalog. This means you can pick a catalog implementation without locking in your query engines. Every engine points at the same REST endpoint.&lt;/p&gt;
&lt;h2&gt;Getting Started: Apache Iceberg on Dremio&lt;/h2&gt;
&lt;p&gt;You can get hands-on with Iceberg tables right now using Dremio Cloud. Here&apos;s the quick path:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Sign up at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/strong&gt; You&apos;ll get a free 30-day trial. Dremio creates a lakehouse project and an Open Catalog (powered by Apache Polaris) automatically at signup.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Create an Iceberg table and insert data:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS db;
CREATE FOLDER IF NOT EXISTS db.schema;

CREATE TABLE db.schema.sales (
  order_id INT,
  customer_name VARCHAR,
  product VARCHAR,
  quantity INT,
  order_date DATE,
  total_amount DECIMAL(10,2)
) PARTITION BY (MONTH(order_date));

INSERT INTO db.schema.sales VALUES
  (1, &apos;Alice Chen&apos;, &apos;Widget A&apos;, 10, &apos;2025-01-15&apos;, 150.00),
  (2, &apos;Bob Smith&apos;, &apos;Widget B&apos;, 5, &apos;2025-01-20&apos;, 75.00),
  (3, &apos;Carol Davis&apos;, &apos;Widget A&apos;, 8, &apos;2025-02-10&apos;, 120.00);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice the &lt;code&gt;PARTITION BY (MONTH(order_date))&lt;/code&gt;. That&apos;s hidden partitioning in action. You query &lt;code&gt;order_date&lt;/code&gt; directly; Iceberg handles the partitioning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Query the metadata tables.&lt;/strong&gt; Dremio exposes Iceberg&apos;s metadata through &lt;code&gt;TABLE()&lt;/code&gt; functions. These let you inspect the internal state of your table without touching the raw metadata files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- View all snapshots (who committed what and when)
SELECT * FROM TABLE(table_snapshot(&apos;db.schema.sales&apos;));

-- View commit history
SELECT * FROM TABLE(table_history(&apos;db.schema.sales&apos;));

-- View manifest file details
SELECT * FROM TABLE(table_manifests(&apos;db.schema.sales&apos;));

-- View partition statistics
SELECT * FROM TABLE(table_partitions(&apos;db.schema.sales&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;table_snapshot&lt;/code&gt; query shows each snapshot ID, timestamp, and the operation that created it (append, overwrite, delete). The &lt;code&gt;table_manifests&lt;/code&gt; query reveals how many data files and delete files exist in each manifest. Run these after each INSERT or DELETE to see how Iceberg tracks changes internally.&lt;/p&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;This post covers the essentials, but Iceberg&apos;s spec and ecosystem run deep. If you want the full picture, three books cover the subject end to end:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;&lt;strong&gt;Apache Iceberg: The Definitive Guide&lt;/strong&gt;&lt;/a&gt; (O&apos;Reilly) by Tomer Shiran, Jason Hughes, and Alex Merced. Free download from Dremio.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-the-definitive-guide-reg.html&quot;&gt;&lt;strong&gt;Apache Polaris: The Definitive Guide&lt;/strong&gt;&lt;/a&gt; (O&apos;Reilly) by Alex Merced, Andrew Madson, and Tomer Shiran. Free download from Dremio.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.manning.com/books/architecting-an-apache-iceberg-lakehouse&quot;&gt;&lt;strong&gt;Architecting an Apache Iceberg Lakehouse&lt;/strong&gt;&lt;/a&gt; (Manning) by Alex Merced. A hands-on guide to designing modular lakehouse architectures with Spark, Flink, Dremio, and Polaris.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Between these three resources and a free Dremio Cloud trial, you&apos;ll have everything you need to build on Apache Iceberg in production.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Developer Community&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Join the Dremio Developer Community Slack Community to learn more about Apache Iceberg, Data Lakehouses and Agentic Analytics.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>RAG Isn’t a Modeling Problem. It’s a Data Engineering Problem.</title><link>https://iceberglakehouse.com/posts/2026-01-rag-isnt-the-problem/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-rag-isnt-the-problem/</guid><description>
**Get Data Lakehouse Books:**

- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](h...</description><pubDate>Tue, 20 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Retrieval-augmented generation looks deceptively simple.&lt;br&gt;
Embed documents.&lt;br&gt;
Store vectors.&lt;br&gt;
Retrieve context.&lt;br&gt;
Ask an LLM to answer questions.&lt;/p&gt;
&lt;p&gt;Early demos reinforce this illusion. A small corpus. Clean documents. Few users. Results look impressive. Many teams conclude that success depends on choosing the right model or the best vector database.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/bchA2fV.png&quot; alt=&quot;Rag is not so easy&quot;&gt;&lt;/p&gt;
&lt;p&gt;That assumption breaks down fast.&lt;/p&gt;
&lt;p&gt;Once RAG systems move into real enterprise environments, progress stalls. Accuracy plateaus. Latency spikes. Answers lose trust. Security teams raise alarms. Engineering teams realize the bottleneck is not the model.&lt;/p&gt;
&lt;p&gt;It is the data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/5QA6zcM.png&quot; alt=&quot;Bottlenecks&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most organizations do not suffer from a lack of embeddings. They suffer from fragmented data, unclear definitions, inconsistent permissions, and legacy systems never designed for AI access. RAG exposes these weaknesses immediately. It does not hide them.&lt;/p&gt;
&lt;p&gt;This is why RAG is turning into a data engineering problem first, and a modeling problem second.&lt;/p&gt;
&lt;h2&gt;Where RAG Systems Actually Break Down&lt;/h2&gt;
&lt;p&gt;Enterprise data is messy by default. It lives across warehouses, lakes, SaaS tools, document systems, and operational databases. Each source uses different schemas, naming conventions, and access rules. RAG systems must unify all of it before retrieval even begins.&lt;/p&gt;
&lt;p&gt;Data quality issues amplify the problem. Duplicate documents inflate embeddings. Stale records surface outdated answers. Inconsistent metadata makes relevance scoring unreliable. The model retrieves content correctly, but the content itself is wrong.&lt;/p&gt;
&lt;p&gt;Governance is the most underestimated failure point. Many RAG pipelines ignore permissions or apply them too late. This creates two bad outcomes. Either the system leaks sensitive data, or engineers restrict access so aggressively that answers become incomplete. Both outcomes erode trust.&lt;/p&gt;
&lt;p&gt;Semantic ambiguity adds another layer of friction. Business terms rarely mean one thing. “Revenue,” “active customer,” or “churn” vary by team and context. Vector similarity cannot resolve these differences. Without shared definitions, RAG systems retrieve text, not meaning.&lt;/p&gt;
&lt;p&gt;These failures have nothing to do with LLM quality. They stem from weak data foundations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/QCug4pf.png&quot; alt=&quot;The Real Data Problem&quot;&gt;&lt;/p&gt;
&lt;p&gt;As a result, teams over-engineer retrieval layers while under-investing in context. They tune indexes. Swap vector databases. Adjust chunk sizes. The core issues remain.&lt;/p&gt;
&lt;p&gt;RAG systems succeed when they start with governed, well-defined, and accessible data. When they do not, no amount of modeling innovation compensates for the gap.&lt;/p&gt;
&lt;h2&gt;Are Vector Databases Over-Engineered for Most Teams?&lt;/h2&gt;
&lt;p&gt;Vector databases became the default RAG component for a simple reason. They solved a real problem early. Fast similarity search over high-dimensional embeddings was hard to do well. Purpose-built systems filled that gap.&lt;/p&gt;
&lt;p&gt;The problem is that the industry quickly treated them as mandatory infrastructure.&lt;/p&gt;
&lt;p&gt;For many enterprise use cases, that assumption does not hold. Most RAG workloads do not start at billion-scale embeddings. They start with thousands or tens of thousands of documents. At that scale, established systems like Postgres with pgvector or search engines with vector support perform well enough.&lt;/p&gt;
&lt;p&gt;These platforms already exist in most organizations. They are governed. They are monitored. They are understood by operations teams. Adding vector search to them is often cheaper and faster than introducing a new system.&lt;/p&gt;
&lt;p&gt;Specialized vector databases still have a role. At large scale, with strict latency requirements and high concurrency, optimized ANN indexes and distributed architectures matter. The tipping point is real. It just arrives later than vendors suggest.&lt;/p&gt;
&lt;p&gt;The mistake is not using vector databases. The mistake is leading with them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/7y5H7hZ.png&quot; alt=&quot;The mistake is not using vector databases. The mistake is leading with them.&quot;&gt;&lt;/p&gt;
&lt;p&gt;When teams optimize the vector layer first, they ignore higher-impact problems. Data duplication. Permission enforcement. Metadata consistency. Hybrid retrieval logic. These issues dominate cost and complexity long before vector search performance does.&lt;/p&gt;
&lt;h2&gt;Hybrid Search Is the Norm, Not the Exception&lt;/h2&gt;
&lt;p&gt;Vector search alone is rarely sufficient. Keyword search alone is rarely sufficient. Production RAG systems need both.&lt;/p&gt;
&lt;p&gt;Keywords provide precision. Vectors provide semantic recall. Together, they outperform either approach in isolation. This pattern shows up consistently across enterprise deployments.&lt;/p&gt;
&lt;p&gt;Despite advances in embedding models, keyword search is not becoming obsolete. Embeddings still struggle with exact matches, rare identifiers, and domain-specific language. They also struggle when the query intent is narrow and literal.&lt;/p&gt;
&lt;p&gt;As a result, teams maintain two indexes. One lexical. One vector. They fuse results during retrieval or re-ranking. This adds operational cost, but it improves answer quality.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/KpDU6Wu.png&quot; alt=&quot;Hybrid retrieval should be an assumption, not an optimization.&quot;&gt;&lt;/p&gt;
&lt;p&gt;Some hope that better models will eliminate this complexity. That is unlikely in the near term. Language is both semantic and symbolic. Search systems must reflect that reality.&lt;/p&gt;
&lt;p&gt;The practical takeaway is simple. Hybrid retrieval should be an assumption, not an optimization. Architectures that treat vector search as a drop-in replacement for text search fail under real workloads.&lt;/p&gt;
&lt;h2&gt;Latency Changes Every Design Decision&lt;/h2&gt;
&lt;p&gt;Real-time RAG systems operate under tight latency budgets. Users expect responses in seconds, not tens of seconds. Retrieval time competes directly with model inference time.&lt;/p&gt;
&lt;p&gt;To stay within budget, teams make trade-offs. They cache results. Use approximate search. Reduce embedding size. Retrieve fewer documents. Choose smaller or faster models.&lt;/p&gt;
&lt;p&gt;Each choice sacrifices something. Recall. Freshness. Completeness.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jw9h3okkfm764ipq4kwy.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;The best systems compensate by pushing intelligence closer to the data. Precomputed results. Materialized views. Semantic caching. These techniques reduce work at query time and stabilize performance.&lt;/p&gt;
&lt;p&gt;Once again, the bottleneck is not the model. It is the architecture around the data.&lt;/p&gt;
&lt;h2&gt;The Missing Layer: Semantic Context&lt;/h2&gt;
&lt;p&gt;Most RAG architectures treat embeddings as context. That is a mistake.&lt;/p&gt;
&lt;p&gt;Embeddings capture similarity, not meaning. They do not encode business logic, metric definitions, or governance rules. They do not understand which tables represent the same concept, or which fields are authoritative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4ax0vnqagp3ys56umjkj.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is where many systems quietly fail. AI agents retrieve text fragments without understanding how those fragments relate. Answers may be syntactically correct but semantically wrong.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7g3ewdoxn51kz0hcadce.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;A semantic layer changes this dynamic. It provides shared definitions, governed access, and a consistent abstraction over raw data. Instead of retrieving arbitrary documents, AI agents retrieve &lt;em&gt;meaningful concepts&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dlu6sjseno7sy7hxpqhi.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This reduces ambiguity. It improves trust. It lowers the cognitive load on both users and models.&lt;/p&gt;
&lt;p&gt;More importantly, it shifts RAG from document search to reasoning over data.&lt;/p&gt;
&lt;h2&gt;From RAG Pipelines to Agentic Architectures&lt;/h2&gt;
&lt;p&gt;As systems evolve, retrieval alone is not enough. AI agents need to ask follow-up questions, call tools, execute queries, and reason across steps.&lt;/p&gt;
&lt;p&gt;This requires structured access to data, not just text chunks. It also requires standard interfaces so agents can operate across clients and environments.&lt;/p&gt;
&lt;p&gt;Open protocols like MCP reflect this shift. They decouple AI agents from specific tools and allow shared context to be reused across applications. This moves RAG closer to a platform capability than a one-off pipeline.&lt;/p&gt;
&lt;p&gt;In this world, the value is not in where vectors live. The value is in how context is defined, governed, and exposed.&lt;/p&gt;
&lt;h2&gt;Conclusion: Stop Optimizing the Wrong Layer&lt;/h2&gt;
&lt;p&gt;RAG failures rarely come from weak models. They come from weak data foundations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hz8ll1791x9z8q6eguw5.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Enterprises over-invest in vector infrastructure while under-investing in semantics, governance, and architectural coherence. The result is expensive systems that scale poorly and fail to earn trust.&lt;/p&gt;
&lt;p&gt;The most resilient approaches treat RAG as a data platform problem. They start with open storage, shared definitions, hybrid retrieval, and performance optimizations that benefit every workload.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xjcm3d7ov49ltujhaj9o.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is where lakehouse-native architectures stand out. Platforms like Dremio focus on unifying data access, enforcing semantics, and accelerating queries across sources without duplication. When AI agents are layered on top of that foundation, retrieval becomes simpler, safer, and faster by default.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ojzh2yl2i5ttb4rgdanp.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;As models continue to improve, data problems will remain. Teams that solve for context, not just embeddings, will be the ones that scale AI beyond demos and into durable systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Practical Guide to AI-Assisted Coding Tools</title><link>https://iceberglakehouse.com/posts/2026-01-a-practical-guide-to-ai-assisted-coding-tools/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-a-practical-guide-to-ai-assisted-coding-tools/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2026-01-A-Practical-Guide-to-AI-Assisted...</description><pubDate>Thu, 15 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2026-01-A-Practical-Guide-to-AI-Assisted-Coding/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;AI-assisted coding is no longer a novelty. It is becoming a core part of how software gets built.&lt;/p&gt;
&lt;p&gt;For years, these tools were easy to describe. They were autocomplete engines. They helped you write boilerplate faster and saved a few keystrokes. Useful, but limited.&lt;/p&gt;
&lt;p&gt;That changed quickly.&lt;/p&gt;
&lt;p&gt;Over the last two years, large language models gained larger context windows, stronger reasoning, and the ability to use tools. At the same time, AI assistants moved closer to the developer workflow. They gained access to repositories, terminals, build systems, tests, and browsers. What emerged was not just better autocomplete, but something closer to a collaborator.&lt;/p&gt;
&lt;p&gt;Today, “AI coding tools” covers a wide range of products. Some live in the terminal and act as autonomous agents. Others are AI-native editors built around chat and planning. Many integrate directly into existing IDEs and quietly assist as you type. Each category solves different problems and comes with different tradeoffs.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/phtd3xiud0aue8np37bz.png&quot; alt=&quot;This creates confusion for developers trying to make sense of the space.&quot;&gt;&lt;/p&gt;
&lt;p&gt;This creates confusion for developers trying to make sense of the space. Should you use a CLI agent or an IDE plugin? When does an AI-first editor make sense? How much autonomy is helpful before it becomes risky? And how do pricing, privacy, and workflow fit into the decision?&lt;/p&gt;
&lt;p&gt;This blog is a practical guide to that landscape. We will categorize the major types of AI-assisted coding tools, compare how they work, and explain when each approach makes sense. The goal is not to crown a single “best” tool, but to give you a clear mental model for choosing the right one for your work.&lt;/p&gt;
&lt;h2&gt;The Core Taxonomy of AI Coding Tools&lt;/h2&gt;
&lt;p&gt;Before comparing individual products, it helps to understand how these tools differ at a structural level. Most confusion in this space comes from treating all AI coding tools as the same thing. They are not.&lt;/p&gt;
&lt;p&gt;There are three dimensions that matter most: how you interact with the tool, where it runs, and how much autonomy it has.&lt;/p&gt;
&lt;h3&gt;Interaction Model&lt;/h3&gt;
&lt;p&gt;Some tools are designed to assist while you type. These focus on inline suggestions and small edits. You stay in control at all times, and the AI reacts to your actions.&lt;/p&gt;
&lt;p&gt;Others are chat-driven. You describe what you want in natural language, and the tool responds with explanations, code snippets, or suggested changes. These are useful for learning, debugging, and reasoning about unfamiliar code.&lt;/p&gt;
&lt;p&gt;The newest category is agent-based. These tools accept a goal, break it into steps, and execute those steps across files and tools. They plan, act, and revise, often with minimal input once started.&lt;/p&gt;
&lt;h3&gt;Execution Surface&lt;/h3&gt;
&lt;p&gt;Where a tool lives shapes how powerful it can be.&lt;/p&gt;
&lt;p&gt;Terminal-based tools operate directly on your filesystem and development tools. They can run tests, modify many files, and integrate naturally with scripting and automation workflows.&lt;/p&gt;
&lt;p&gt;IDE-native editors are built around AI as a first-class concept. They blend editing, chat, execution, and preview into a single environment designed for iterative work with an assistant.&lt;/p&gt;
&lt;p&gt;IDE plugins integrate into existing editors. They trade raw power for familiarity and low friction. You get help without changing how you work.&lt;/p&gt;
&lt;p&gt;Browser-based tools prioritize accessibility and collaboration but are usually more constrained in what they can access or modify.&lt;/p&gt;
&lt;h3&gt;Autonomy Spectrum&lt;/h3&gt;
&lt;p&gt;Not all AI tools act independently.&lt;/p&gt;
&lt;p&gt;Some only suggest. You decide what to accept.&lt;/p&gt;
&lt;p&gt;Some perform tasks but wait for confirmation before each step.&lt;/p&gt;
&lt;p&gt;Others operate with high autonomy. They plan multi-step changes, run commands, and verify results before handing control back to you.&lt;/p&gt;
&lt;p&gt;More autonomy can mean more leverage. It also means more responsibility. Understanding where a tool sits on this spectrum is critical for using it safely and effectively.&lt;/p&gt;
&lt;p&gt;With these dimensions in mind, the rest of the landscape becomes much easier to navigate. Each tool is a different point in this design space, optimized for different types of work and different levels of trust.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/03qpl1zwmcggbe62ikac.png&quot; alt=&quot;Each tool is a different point in this design space, optimized for different types of work and different levels of trust.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Terminal-Based AI Coding Agents&lt;/h2&gt;
&lt;p&gt;Terminal-based AI coding agents are the most powerful and, at times, the most intimidating tools in this space. They live where your code actually runs. That gives them capabilities that IDE plugins cannot match.&lt;/p&gt;
&lt;p&gt;Instead of suggesting code, these tools operate directly on your project. They can read files, modify directories, run tests, execute build commands, and interact with version control. In practice, this means they behave less like autocomplete and more like junior engineers following instructions.&lt;/p&gt;
&lt;h3&gt;Why Terminal Agents Exist&lt;/h3&gt;
&lt;p&gt;The terminal is already the control plane for software development. It is where builds run, tests fail, migrations execute, and deployments start. By placing AI here, these tools gain first-class access to the real workflow rather than a simulated one.&lt;/p&gt;
&lt;p&gt;This makes them well-suited for tasks that span many files or steps. Examples include refactoring large codebases, fixing failing test suites, scaffolding new services, or migrating configurations. These are jobs that are slow and error-prone when done manually.&lt;/p&gt;
&lt;h3&gt;Representative Tools&lt;/h3&gt;
&lt;p&gt;Tools in this category include Claude Code, Gemini CLI, OpenCode, and Qodo CLI. While they differ in implementation, they share common traits.&lt;/p&gt;
&lt;p&gt;They accept high-level goals instead of line-level instructions. They reason about the repository as a whole. They can chain actions together without repeated prompting. Many of them support approval checkpoints so you can review actions before execution.&lt;/p&gt;
&lt;p&gt;Some focus on being general-purpose agents. Others emphasize customization, allowing teams to define their own agents for reviews, testing, or compliance checks.&lt;/p&gt;
&lt;h3&gt;Strengths and Tradeoffs&lt;/h3&gt;
&lt;p&gt;The strength of terminal agents is leverage. A single prompt can replace dozens of manual steps. They are especially effective for backend, infrastructure, and data engineering work, where tasks are procedural and tool-driven.&lt;/p&gt;
&lt;p&gt;The tradeoff is risk. These tools can change many files quickly. They can run commands that alter state. Used carelessly, they can introduce subtle bugs or destructive changes.&lt;/p&gt;
&lt;p&gt;Best practice is to treat terminal agents as powerful automation tools. Keep them scoped. Review diffs. Use version control aggressively. Start with low autonomy and increase it only when trust is earned.&lt;/p&gt;
&lt;p&gt;Terminal-based agents are not for every developer or every task. But when used well, they represent one of the biggest productivity jumps in modern software development.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y678648xd3lxhzb5kq0v.png&quot; alt=&quot;Terminal-based agents are not for every developer or every task.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;AI-Native IDEs and Editors&lt;/h2&gt;
&lt;p&gt;AI-native IDEs are built around the assumption that an assistant is always present. Instead of adding AI as a feature, these tools redesign the editor itself to make planning, execution, and iteration flow through the model.&lt;/p&gt;
&lt;p&gt;This changes how development feels. You do not switch between typing code and asking for help. The conversation and the code evolve together.&lt;/p&gt;
&lt;h3&gt;What Makes an IDE AI-Native&lt;/h3&gt;
&lt;p&gt;In an AI-native IDE, the assistant has persistent awareness of the project. It understands file structure, dependencies, and recent changes without being reminded each time.&lt;/p&gt;
&lt;p&gt;These editors usually combine several capabilities in one place. You can ask the assistant to design a feature, generate code across files, run the application, and inspect the results. Some can open a browser, preview a UI, or analyze logs as part of the same workflow.&lt;/p&gt;
&lt;p&gt;Another defining trait is planning. The assistant often explains what it is going to do before doing it. This makes complex changes easier to reason about and review.&lt;/p&gt;
&lt;h3&gt;Representative Tools&lt;/h3&gt;
&lt;p&gt;Examples in this category include Cursor, Windsurf, Antigravity, and Zed.&lt;/p&gt;
&lt;p&gt;Cursor extends the familiar VS Code experience with deep repository understanding and large-scale refactoring capabilities. Windsurf emphasizes agent-driven workflows that keep developers in flow. Antigravity pushes further into full agent autonomy, allowing models to plan, build, and verify changes using integrated tools. Zed focuses on speed, collaboration, and predictive editing, blending performance with AI assistance.&lt;/p&gt;
&lt;p&gt;While their design philosophies differ, all of them treat AI as a core part of the editing experience rather than an add-on.&lt;/p&gt;
&lt;h3&gt;When an AI-Native IDE Makes Sense&lt;/h3&gt;
&lt;p&gt;These tools shine when you are building features end to end. They work well for rapid prototyping, greenfield projects, and iterative product development.&lt;/p&gt;
&lt;p&gt;They are also a good fit for solo developers or small teams, where context switching is expensive and speed matters more than strict process. For some developers, they can replace multiple tools with a single environment.&lt;/p&gt;
&lt;p&gt;The downside is commitment. Adopting an AI-native IDE often means changing editors or workflows. For teams with established tooling or strict policies, that may be a barrier.&lt;/p&gt;
&lt;p&gt;When the fit is right, though, AI-native IDEs offer a glimpse of what development looks like when the assistant is not a helper, but a constant collaborator.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w9d3kizpil6o489pm73q.png&quot; alt=&quot;When the fit is right, though, AI-native IDEs offer a glimpse of what development looks like when the assistant is not a helper, but a constant collaborator.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;AI Assistants Embedded in Traditional IDEs&lt;/h2&gt;
&lt;p&gt;Not every developer wants to change editors or rethink their workflow. For many teams, the most practical entry point into AI-assisted coding is through tools that integrate directly into existing IDEs.&lt;/p&gt;
&lt;p&gt;These assistants focus on augmentation rather than replacement. They enhance familiar environments with AI capabilities while preserving established habits, shortcuts, and extensions.&lt;/p&gt;
&lt;h3&gt;The Copilot Model&lt;/h3&gt;
&lt;p&gt;This category is defined by inline assistance. The AI observes the code you are writing and offers suggestions in real time. You remain in control, accepting or rejecting changes as you go.&lt;/p&gt;
&lt;p&gt;Most tools in this group also include a chat interface. This allows you to ask questions about your code, request explanations, generate tests, or debug errors without leaving the editor. The interaction is conversational, but the execution remains manual.&lt;/p&gt;
&lt;p&gt;The emphasis is on incremental gains. These tools aim to make each coding session smoother rather than automate entire tasks.&lt;/p&gt;
&lt;h3&gt;Representative Tools&lt;/h3&gt;
&lt;p&gt;GitHub Copilot is the most well-known example. Others include Amazon CodeWhisperer and Amazon Q Developer, JetBrains AI Assistant, Tabnine, and Replit Ghostwriter.&lt;/p&gt;
&lt;p&gt;These tools support a wide range of IDEs such as VS Code, JetBrains products, and browser-based environments. They tend to work across many programming languages and frameworks, making them broadly applicable.&lt;/p&gt;
&lt;p&gt;Some lean toward individual productivity. Others emphasize enterprise features like policy enforcement, auditability, and security scanning.&lt;/p&gt;
&lt;h3&gt;Strengths and Limitations&lt;/h3&gt;
&lt;p&gt;The biggest strength of IDE-embedded assistants is low friction. Developers can adopt them with minimal change and see immediate benefits. They are well suited for day-to-day coding, learning new APIs, and reducing repetitive work.&lt;/p&gt;
&lt;p&gt;Their limitation is scope. They usually do not plan or execute multi-step changes on their own. They lack direct access to the terminal and external tools, which limits their autonomy.&lt;/p&gt;
&lt;p&gt;For many teams, this is a feature, not a flaw. Embedded assistants provide a safe, predictable way to bring AI into the development process without surrendering control.&lt;/p&gt;
&lt;p&gt;They are often the right choice when consistency, governance, and gradual adoption matter more than maximum automation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kzndqmqrsziyr76gbn78.png&quot; alt=&quot;Embedded assistants provide a safe, predictable way to bring AI into the development process without surrendering control.&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a0176p2xrqupus1utebh.png&quot; alt=&quot;Comparison of Approaches&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Pricing Models and Economic Tradeoffs&lt;/h2&gt;
&lt;p&gt;AI-assisted coding tools vary widely in their pricing. Understanding these models is important, because cost often scales with autonomy, context size, and usage intensity.&lt;/p&gt;
&lt;p&gt;What looks inexpensive at first can become costly at scale. What looks expensive may replace significant engineering time.&lt;/p&gt;
&lt;h3&gt;Common Pricing Patterns&lt;/h3&gt;
&lt;p&gt;One common approach is free or freemium access for individuals. These tiers usually offer limited usage, smaller context windows, or restricted agent capabilities. They are designed to encourage experimentation and personal use.&lt;/p&gt;
&lt;p&gt;Another model is flat monthly subscriptions per developer. This is common for IDE plugins and AI-native editors. In exchange for a predictable cost, you get higher usage limits, access to stronger models, and better performance.&lt;/p&gt;
&lt;p&gt;Agentic tools often introduce credit-based pricing. Each task or action consumes credits based on model usage, context size, and tool execution. This aligns cost with work performed but requires more monitoring.&lt;/p&gt;
&lt;p&gt;Enterprise plans layer on governance features. These include audit logs, centralized billing, access controls, and private deployments. Pricing here reflects not just usage, but risk reduction and compliance.&lt;/p&gt;
&lt;h3&gt;Cost vs Capability Tradeoffs&lt;/h3&gt;
&lt;p&gt;More powerful tools cost more because they do more. Large context windows, multi-file reasoning, and autonomous execution all increase compute usage.&lt;/p&gt;
&lt;p&gt;Autocomplete-focused tools are usually the cheapest. Agent-based systems are the most expensive, especially when used heavily.&lt;/p&gt;
&lt;p&gt;Another factor is model flexibility. Tools that allow you to bring your own API keys shift costs directly to the underlying model provider. This can be cheaper or more expensive depending on how you use them.&lt;/p&gt;
&lt;p&gt;The right question is not “which tool is cheapest,” but “which tool replaces the most manual effort for my work.”&lt;/p&gt;
&lt;h3&gt;Individual vs Team Economics&lt;/h3&gt;
&lt;p&gt;For individuals, free tiers and modest subscriptions often deliver outsized value. Even small time savings justify the cost.&lt;/p&gt;
&lt;p&gt;For teams, the equation changes. A tool that saves minutes per developer per day may justify its cost. One that automates entire workflows may justify much more, but only if guardrails are in place.&lt;/p&gt;
&lt;p&gt;Understanding pricing early helps avoid mismatches between expectations, usage, and budget. AI tools are productivity multipliers, but only when their costs align with how they are used.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g3yd90xw964xkly6ybd1.png&quot; alt=&quot;Understanding pricing early helps avoid mismatches between expectations, usage, and budget.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Workflow Patterns Enabled by AI Coding Tools&lt;/h2&gt;
&lt;p&gt;The real impact of AI-assisted coding is not in individual features, but in how workflows change. Once these tools are part of daily work, the structure of development itself begins to shift.&lt;/p&gt;
&lt;p&gt;Instead of writing everything by hand, developers increasingly describe intent, review outcomes, and refine results.&lt;/p&gt;
&lt;h3&gt;Common AI-Driven Workflows&lt;/h3&gt;
&lt;p&gt;One of the most common patterns is assisted implementation. Developers sketch function signatures or write descriptive comments, then let the AI fill in the logic. This is especially effective for boilerplate, data transformations, and repetitive patterns.&lt;/p&gt;
&lt;p&gt;Debugging is another strong use case. AI tools can explain error messages, trace logic across files, and suggest fixes based on context. This reduces time spent searching documentation or past issues.&lt;/p&gt;
&lt;p&gt;Test and documentation generation have also become routine. Many teams now generate unit tests, integration tests, and API docs as part of normal development, not as an afterthought.&lt;/p&gt;
&lt;h3&gt;Agentic Workflows&lt;/h3&gt;
&lt;p&gt;Agentic tools enable workflows that were previously impractical.&lt;/p&gt;
&lt;p&gt;A single prompt can scaffold a new service, refactor an entire module, or migrate configurations across environments. The agent plans the steps, applies changes, and verifies results before returning control.&lt;/p&gt;
&lt;p&gt;These workflows work best when tasks are well-scoped and repeatable. Infrastructure changes, dependency upgrades, and large-scale refactors are strong candidates.&lt;/p&gt;
&lt;p&gt;The key is oversight. Developers define the goal and constraints, then review the agent’s output carefully. Agentic workflows reward clarity and discipline.&lt;/p&gt;
&lt;h3&gt;Shifting the Role of the Developer&lt;/h3&gt;
&lt;p&gt;As AI takes on more mechanical work, the developer’s role shifts toward design, review, and decision-making.&lt;/p&gt;
&lt;p&gt;Time moves away from syntax and toward intent. Understanding systems and tradeoffs becomes more valuable than memorizing APIs.&lt;/p&gt;
&lt;p&gt;Teams that adapt their workflows intentionally see the biggest gains. Those that treat AI as a novelty often see uneven results.&lt;/p&gt;
&lt;p&gt;AI does not remove the need for good engineering practices. It amplifies them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7an09rcgy0wv0fjt6n9q.png&quot; alt=&quot;AI does not remove the need for good engineering practices. It amplifies them.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Skills Developers Need in the AI Coding Era&lt;/h2&gt;
&lt;p&gt;AI-assisted coding changes what it means to be effective as a developer. The most valuable skills are shifting away from speed of typing and toward clarity of thinking.&lt;/p&gt;
&lt;p&gt;Using these tools well is not about tricks. It is about communication, judgment, and system-level understanding.&lt;/p&gt;
&lt;h3&gt;Prompting as Specification&lt;/h3&gt;
&lt;p&gt;Prompting is best understood as writing specifications in natural language.&lt;/p&gt;
&lt;p&gt;Clear prompts describe intent, constraints, and context. Vague prompts produce vague results. The best outcomes come from treating the AI like a teammate who needs good requirements.&lt;/p&gt;
&lt;p&gt;Effective developers iterate. They refine prompts based on output, correct assumptions, and narrow scope. This feedback loop is fast, but it still requires attention.&lt;/p&gt;
&lt;h3&gt;Review and Verification&lt;/h3&gt;
&lt;p&gt;AI-generated code must be reviewed like any other contribution.&lt;/p&gt;
&lt;p&gt;Developers need to read diffs carefully, understand the logic, and verify behavior with tests. Blind trust leads to subtle bugs and security issues.&lt;/p&gt;
&lt;p&gt;Knowing how to ask the AI to explain its choices is a useful verification technique. If the explanation does not make sense, the code likely does not either.&lt;/p&gt;
&lt;h3&gt;System Thinking and Constraints&lt;/h3&gt;
&lt;p&gt;AI tools are strongest when they understand the system they are working in.&lt;/p&gt;
&lt;p&gt;Developers who can explain architecture, performance constraints, and operational requirements get better results. This includes knowing what not to automate.&lt;/p&gt;
&lt;p&gt;The more autonomy a tool has, the more important boundaries become. Skilled developers define those boundaries clearly.&lt;/p&gt;
&lt;p&gt;In the AI coding era, judgment matters more than ever. The tools move fast. It is the developer’s responsibility to steer them well.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7njfidnjgz1mht7dym4q.png&quot; alt=&quot;The more autonomy a tool has, the more important boundaries become. Skilled developers define those boundaries clearly.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Security, Privacy, and Governance Considerations&lt;/h2&gt;
&lt;p&gt;As AI coding tools gain access to repositories, terminals, and infrastructure, security and governance move from secondary concerns to first-order design questions.&lt;/p&gt;
&lt;p&gt;The risks are not hypothetical. These tools can read proprietary code, modify critical systems, and generate output that looks correct but is not.&lt;/p&gt;
&lt;h3&gt;Code and Data Exposure&lt;/h3&gt;
&lt;p&gt;Most AI tools rely on remote models. This means code or prompts may leave your local environment.&lt;/p&gt;
&lt;p&gt;Developers and teams must understand what data is sent, how long it is retained, and whether it is used for training. Some tools explicitly guarantee no training on customer code. Others allow opt-outs or require enterprise agreements.&lt;/p&gt;
&lt;p&gt;For sensitive environments, tools that support local models or on-prem deployment reduce exposure. This often comes at the cost of convenience or model quality.&lt;/p&gt;
&lt;h3&gt;Autonomy and Guardrails&lt;/h3&gt;
&lt;p&gt;Agentic tools increase risk by design. They can execute commands, modify configurations, and affect production systems.&lt;/p&gt;
&lt;p&gt;Guardrails are essential. These include confirmation prompts, restricted permissions, read-only modes, and sandboxed environments. Version control is a non-negotiable safety net.&lt;/p&gt;
&lt;p&gt;The goal is not to eliminate autonomy, but to scope it carefully.&lt;/p&gt;
&lt;h3&gt;Organizational Governance&lt;/h3&gt;
&lt;p&gt;For teams, governance features matter as much as raw capability.&lt;/p&gt;
&lt;p&gt;Audit logs, access controls, usage monitoring, and policy enforcement help organizations understand how AI tools are being used. They also help prevent accidental misuse.&lt;/p&gt;
&lt;p&gt;Clear guidelines reduce risk. Teams should define which tools are allowed, what data they can access, and what level of autonomy is acceptable.&lt;/p&gt;
&lt;p&gt;AI-assisted coding can be safe and effective. It requires intentional design, not blind adoption.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5n7d6ntsqu6ybadaqjda.png&quot; alt=&quot;AI-assisted coding can be safe and effective. It requires intentional design, not blind adoption.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;How to Choose the Right Tool for You&lt;/h2&gt;
&lt;p&gt;With so many options, choosing an AI coding tool can feel overwhelming. The key is to match the tool to your role, environment, and tolerance for change.&lt;/p&gt;
&lt;p&gt;There is no universal best choice. There is only what fits your work.&lt;/p&gt;
&lt;h3&gt;By Role&lt;/h3&gt;
&lt;p&gt;Solo developers often benefit from AI-native IDEs or terminal agents. These tools reduce context switching and accelerate end-to-end work. They are well suited for prototyping, side projects, and greenfield development.&lt;/p&gt;
&lt;p&gt;Backend and platform engineers often gain the most from terminal-based agents. These tools align naturally with scripting, automation, and infrastructure tasks.&lt;/p&gt;
&lt;p&gt;Frontend and product-focused developers may prefer AI-native editors or IDE plugins that emphasize iteration, previews, and refactoring.&lt;/p&gt;
&lt;p&gt;Teams working in large codebases often start with IDE-embedded assistants. These tools improve productivity without disrupting existing processes.&lt;/p&gt;
&lt;h3&gt;By Environment&lt;/h3&gt;
&lt;p&gt;Startups and small teams can afford to experiment. Speed and leverage matter more than strict controls, making agentic tools attractive.&lt;/p&gt;
&lt;p&gt;Enterprises prioritize predictability and governance. Tools with clear data policies, audit logs, and controlled autonomy are easier to adopt.&lt;/p&gt;
&lt;p&gt;Highly regulated environments may require on-prem models or strict data isolation. This narrows the field but reduces risk.&lt;/p&gt;
&lt;h3&gt;By Autonomy and Trust&lt;/h3&gt;
&lt;p&gt;If you are new to AI-assisted coding, start with tools that suggest rather than act. Build intuition and confidence before increasing autonomy.&lt;/p&gt;
&lt;p&gt;As trust grows, introduce agents for well-scoped tasks. Avoid full autonomy in critical systems until guardrails are proven.&lt;/p&gt;
&lt;p&gt;The best choice is one that fits your current needs and can evolve with your workflow. AI tools are not static. Your adoption strategy should not be either.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e1ys1lpbs4chio7eyscv.png&quot; alt=&quot;The best choice is one that fits your current needs and can evolve with your workflow.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Future of AI-Assisted Coding&lt;/h2&gt;
&lt;p&gt;AI-assisted coding is still early, but the direction is clear. These tools are moving from helpers to participants in the development process.&lt;/p&gt;
&lt;p&gt;The distinction between editor, assistant, and agent is already starting to blur.&lt;/p&gt;
&lt;h3&gt;Convergence of Tools&lt;/h3&gt;
&lt;p&gt;IDE plugins are gaining agentic capabilities. Terminal agents are adding richer interfaces. AI-native IDEs are absorbing features from both.&lt;/p&gt;
&lt;p&gt;Over time, the market will likely converge around flexible systems that can operate at different levels of autonomy depending on context. One tool may act as an autocomplete engine in one moment and an autonomous agent in the next.&lt;/p&gt;
&lt;h3&gt;Interoperability and Protocols&lt;/h3&gt;
&lt;p&gt;As AI tools grow more capable, interoperability becomes essential.&lt;/p&gt;
&lt;p&gt;Standards for tool access, context sharing, and action execution are emerging. These allow models to interact with editors, terminals, and external systems in consistent ways.&lt;/p&gt;
&lt;p&gt;This reduces lock-in and makes it easier to mix tools, models, and workflows.&lt;/p&gt;
&lt;h3&gt;AI as a First-Class Team Member&lt;/h3&gt;
&lt;p&gt;The long-term shift is conceptual.&lt;/p&gt;
&lt;p&gt;AI tools are evolving from passive assistants into collaborators that can plan work, execute tasks, and verify results. This does not remove the need for human developers. It changes where their effort is spent.&lt;/p&gt;
&lt;p&gt;Design, judgment, and accountability remain human responsibilities. Execution increasingly becomes shared.&lt;/p&gt;
&lt;p&gt;The future of software development is not fully automated. It is more leveraged, more intentional, and more collaborative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3rbb00ghgypc0swt90o4.png&quot; alt=&quot;The future of software development is not fully automated. It is more leveraged, more intentional, and more collaborative.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Sample Prompts to Get Started&lt;/h2&gt;
&lt;p&gt;One of the hardest parts of using AI coding tools for the first time is knowing what to ask. The prompts below are designed to be simple, low-risk, and useful across most tools, whether you are using a terminal agent, an AI-native IDE, or an IDE plugin.&lt;/p&gt;
&lt;p&gt;Each prompt focuses on building or modifying something small while helping you learn how the tool behaves.&lt;/p&gt;
&lt;h3&gt;Prompt 1: Create a Simple Project Skeleton&lt;/h3&gt;
&lt;p&gt;Use this to test repo awareness and file creation.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a simple Python project for a command-line tool.&lt;/p&gt;
&lt;p&gt;It should include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A README&lt;/li&gt;
&lt;li&gt;A main entry file&lt;/li&gt;
&lt;li&gt;A basic argument parser&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not add extra features.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This prompt helps you see how the tool structures files and how much initiative it takes.&lt;/p&gt;
&lt;h3&gt;Prompt 2: Implement a Small Feature From a Description&lt;/h3&gt;
&lt;p&gt;Use this to test code generation quality.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add a function that reads a CSV file and prints the top 5 rows.&lt;/p&gt;
&lt;p&gt;Assume the file path is passed as a command-line argument.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This works well in IDE plugins and editors. Review the code carefully and run it.&lt;/p&gt;
&lt;h3&gt;Prompt 3: Explain Existing Code&lt;/h3&gt;
&lt;p&gt;Use this to test understanding and explanation.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Explain what this function does and identify any edge cases.&lt;/p&gt;
&lt;p&gt;Keep the explanation concise.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is useful for learning unfamiliar code and validating AI understanding.&lt;/p&gt;
&lt;h3&gt;Prompt 4: Generate Tests&lt;/h3&gt;
&lt;p&gt;Use this to test correctness and coverage.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Write unit tests for this function.&lt;/p&gt;
&lt;p&gt;Use the existing testing framework.&lt;/p&gt;
&lt;p&gt;Cover normal cases and one edge case.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This helps establish a review habit and reinforces test-driven thinking.&lt;/p&gt;
&lt;h3&gt;Prompt 5: Refactor for Clarity&lt;/h3&gt;
&lt;p&gt;Use this to test refactoring behavior.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Refactor this code to improve readability.&lt;/p&gt;
&lt;p&gt;Do not change behavior.&lt;/p&gt;
&lt;p&gt;Keep the logic explicit.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Compare the diff to ensure intent is preserved.&lt;/p&gt;
&lt;h3&gt;Prompt 6: Simple Agentic Task (Terminal or AI-Native IDE)&lt;/h3&gt;
&lt;p&gt;Use this to test safe autonomy.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add basic logging to this application.&lt;/p&gt;
&lt;p&gt;Use the existing logging library.&lt;/p&gt;
&lt;p&gt;Show me the changes before committing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This prompt checks whether the agent plans steps and respects boundaries.&lt;/p&gt;
&lt;h3&gt;Prompt 7: Debug a Failure&lt;/h3&gt;
&lt;p&gt;Use this to test reasoning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This test is failing.&lt;/p&gt;
&lt;p&gt;Explain why, then propose a fix.&lt;/p&gt;
&lt;p&gt;Do not apply the fix yet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Only apply changes after reviewing the explanation.&lt;/p&gt;
&lt;h3&gt;How to Use These Prompts Safely&lt;/h3&gt;
&lt;p&gt;Start small. Run tools in a clean project or branch. Review every change.&lt;/p&gt;
&lt;p&gt;Pay attention to how the tool interprets ambiguity. If results are surprising, refine the prompt rather than forcing acceptance.&lt;/p&gt;
&lt;p&gt;Good prompts are clear, scoped, and explicit about constraints. Treat them like lightweight specifications.&lt;/p&gt;
&lt;p&gt;These examples are not about speed. They are about learning how the tool thinks before trusting it with more responsibility.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;AI-assisted coding is no longer a single category of tools. It is an ecosystem with distinct approaches, tradeoffs, and philosophies.&lt;/p&gt;
&lt;p&gt;Terminal agents offer raw power and automation. AI-native IDEs rethink how development flows. IDE-embedded assistants provide steady gains with minimal disruption. Each has a place, and each serves different kinds of work.&lt;/p&gt;
&lt;p&gt;The most important takeaway is intentionality. The value of these tools depends less on which one you choose and more on how you use it. Clear goals, strong review practices, and appropriate guardrails matter more than novelty.&lt;/p&gt;
&lt;p&gt;AI does not replace good engineering. It rewards it.&lt;/p&gt;
&lt;p&gt;Developers who understand their systems, communicate intent clearly, and exercise judgment will see the greatest benefit. Those who treat AI as a shortcut risk confusion and fragility.&lt;/p&gt;
&lt;p&gt;The opportunity is significant. Used well, AI-assisted coding can reduce toil, accelerate learning, and free time for higher-level thinking. The tools are ready. The challenge now is&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building Pangolin - My Holiday Break, an AI IDE, and a Lakehouse Catalog for the Curious</title><link>https://iceberglakehouse.com/posts/2026-01-the-story-of-pangolin-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-the-story-of-pangolin-catalog/</guid><description>
**Get Data Lakehouse Books:**

- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](h...</description><pubDate>Thu, 15 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;1. Introduction: A Holiday, an Agent, and an Idea&lt;/h2&gt;
&lt;p&gt;In December 2025, Google released something that changed how I code - &lt;strong&gt;Antigravity IDE&lt;/strong&gt;. It wasn’t just another brilliant editor. It came packed with AI agents that could write code, test it, refactor it, and even debug alongside you. Naturally, I had to try it out.&lt;/p&gt;
&lt;p&gt;I didn’t jump right into building a big project. Instead, I used it to make some tooling for &lt;a href=&quot;https://www.dremio.com&quot;&gt;Dremio&lt;/a&gt; and &lt;a href=&quot;https://iceberg.apache.org&quot;&gt;Apache Iceberg&lt;/a&gt;, both technologies I work with frequently. That experience set the foundation for something bigger: &lt;a href=&quot;https://pangolincatalog.org&quot;&gt;&lt;strong&gt;Pangolin&lt;/strong&gt;&lt;/a&gt;, an open-source, feature-rich lakehouse catalog.&lt;/p&gt;
&lt;p&gt;This blog tells the story of how Pangolin came to be. It’s not a pitch for production use. It’s a working concept, a glimpse into what’s possible.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/ZnXcL8e.png&quot; alt=&quot;The Pangolin Journey Begins&quot;&gt;&lt;/p&gt;
&lt;h2&gt;2. First Steps: Learning to Trust the Agent&lt;/h2&gt;
&lt;p&gt;Before Pangolin, I started small. I needed to understand how to work with the Antigravity coding agent in a way that felt predictable and collaborative. So I created four tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-cloud-dremioframe&quot;&gt;&lt;strong&gt;dremioframe&lt;/strong&gt;&lt;/a&gt;: A DataFrame-style API for building Dremio SQL queries in Python.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/AlexMercedCoder/iceframe&quot;&gt;&lt;strong&gt;iceframe&lt;/strong&gt;&lt;/a&gt;: A similar API, but for building Iceberg-compatible queries using local compute.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-python-cli/blob/main/readme.md&quot;&gt;&lt;strong&gt;dremio-cli&lt;/strong&gt;&lt;/a&gt;: A command-line tool for interacting with Dremio.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/AlexMercedCoder/iceberg-cli&quot;&gt;&lt;strong&gt;iceberg-cli&lt;/strong&gt;&lt;/a&gt;: A CLI that filled in the gaps left by &lt;code&gt;pyiceberg&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tools weren’t just functional; they became learning tools. I practiced writing clear prompts, specifying inputs and outputs, and most importantly, asking the agent to generate and refine unit, live, and regression tests. I also got better at pushing back when something didn’t work.&lt;/p&gt;
&lt;p&gt;Once I felt confident in that workflow, writing specs, prompting the agent, challenging assumptions, and getting results, I was ready to build something bigger.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/CcYEQjR.png&quot; alt=&quot;Learning to Work with Google&apos;s Antigravity&quot;&gt;&lt;/p&gt;
&lt;h2&gt;3. Rethinking the Lakehouse Catalog&lt;/h2&gt;
&lt;p&gt;Catalogs are central to the Iceberg ecosystem. They’re how engines discover, manage, and track tables. But most catalogs out there either focus on infrastructure or metadata - not both.&lt;/p&gt;
&lt;p&gt;Some great projects inspired Pangolin:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://projectnessie.org&quot;&gt;&lt;strong&gt;Project Nessie&lt;/strong&gt;&lt;/a&gt;: Created at Dremio, Nessie brought Git-like versioning to data catalogs. It’s a brilliant idea that still powers tools like &lt;a href=&quot;https://www.bauplanlabs.com&quot;&gt;Bauplan&lt;/a&gt;. But Nessie doesn’t support features like multi-tenancy or catalog federation.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://polaris.apache.org&quot;&gt;&lt;strong&gt;Apache Polaris&lt;/strong&gt;&lt;/a&gt;: Polaris, co-created by Dremio and Snowflake and now an Apache Incubator project, is well on its way to becoming the open standard. It supports RBAC, catalog federation, generic assets, and upcoming table services that proxy metadata processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business metadata platforms (DataHub, Atlan, Collibra, etc.)&lt;/strong&gt;: These tools focus on discovery and access workflows, and some now support Iceberg. But they bolt onto a catalog - they don’t start as one.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That got me thinking: &lt;em&gt;What if a single open source catalog could do it all?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Pangolin became my experiment to find out.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/vAYnnV5.png&quot; alt=&quot;What if a data lakehouse catalog had it all?&quot;&gt;&lt;/p&gt;
&lt;h2&gt;4. Feature List: The Dream Catalog&lt;/h2&gt;
&lt;p&gt;Before writing a single line of code, I wrote down everything I wanted this catalog to do, the features I admired in other tools, the gaps I noticed, and a few experiments I just wanted to try.&lt;/p&gt;
&lt;p&gt;Here’s what ended up on the list:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalog versioning&lt;/strong&gt;, with support for branching and merging, but scoped-branches don&apos;t have to affect all tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog federation&lt;/strong&gt;, so one catalog can reference others.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generic asset support&lt;/strong&gt;, to register Delta tables, CSV datasets, or even external databases alongside Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business metadata&lt;/strong&gt;, including access requests and grant workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-tenancy&lt;/strong&gt;, so each team can work in its own isolated space.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RBAC and TBAC (tag-based access control)&lt;/strong&gt;, to control access based on roles and tags.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No-auth mode&lt;/strong&gt;, to make it easy to spin up and test locally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential vending&lt;/strong&gt;, with built-in support for AWS, Azure, GCP, and S3-compatible systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pluggable backends&lt;/strong&gt;, starting with PostgreSQL and MongoDB for metadata persistence.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s a lot. But I didn’t set out to build a polished product - I just wanted to see if it was possible.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/4hxzR01.png&quot; alt=&quot;The features I wanted for Pangolin Catalog&quot;&gt;&lt;/p&gt;
&lt;h2&gt;5. Choosing the Stack: Why Rust, Python, and Svelte&lt;/h2&gt;
&lt;p&gt;With the feature list in hand, the next decision was the tech stack. I know Python and JavaScript like the back of my hand, which would’ve made it easy to move fast. But I wanted something that would scale better - and maybe be a little less error-prone.&lt;/p&gt;
&lt;p&gt;I considered three languages for the backend: &lt;strong&gt;Java&lt;/strong&gt;, &lt;strong&gt;Go&lt;/strong&gt;, and &lt;strong&gt;Rust&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Java is the standard in the data world. But writing clean, scalable Java means understanding the JVM inside and out. I know it - but not enough to move quickly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Go is simple and efficient. Rust is strict and safe. Between the two, I picked &lt;strong&gt;Rust&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Rust’s compiler errors are frustrating at first but turn into a superpower. The strong typing and detailed feedback also pair well with AI agents; errors are easier to reason about and fix through prompting.&lt;/p&gt;
&lt;p&gt;For the rest of the stack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rust&lt;/strong&gt; powers the backend and CLI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python&lt;/strong&gt; powers the SDK and scripting layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Svelte&lt;/strong&gt; powers the UI - lightweight and reactive, but more complex than I expected once the feature count grew.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All in, I ended up with a full stack that balanced experimentation and real-world usability. The only problem was... building it all over a holiday break.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/oiv9XeG.png&quot; alt=&quot;The Tech Chosen to Build Pangolin Catalog&quot;&gt;&lt;/p&gt;
&lt;h2&gt;6. Building It: 100 Hours, Three Interfaces, and a Lot of Feedback Loops&lt;/h2&gt;
&lt;p&gt;Once I committed to the stack, the pace picked up fast. I spent roughly 100 hours on Pangolin, which ended up taking most of my holiday break. The backend came together first, followed by the Rust-based CLI and then the Python SDK.&lt;/p&gt;
&lt;p&gt;The backend covered all the core ideas: catalogs, tenants, assets, access rules, and credential vending. Rust helped here. The compiler forced clarity. Each time something felt vague, the type system pushed back until the design made sense.&lt;/p&gt;
&lt;p&gt;The Python SDK turned out better than I expected. It didn’t just wrap the API. It made some features practical. Generic assets are a good example. Through the SDK, those assets became usable for sharing database connections, Delta tables, CSV datasets, and other non-Iceberg data without much friction.&lt;/p&gt;
&lt;p&gt;The hardest part was the UI.&lt;/p&gt;
&lt;p&gt;With so many features, state management became tricky fast. I used Antigravity’s browser agent early on, and it helped catch basic issues. Once the UI grew more complex, manual testing worked better. I spent a lot of time clicking through edge cases, capturing network requests, reading console errors, and feeding that context back to the agent. It was slower, but it worked.&lt;/p&gt;
&lt;p&gt;By the end, Pangolin had three real interfaces: a Rust CLI, a Python SDK, and a Svelte UI. All of them worked against the same API and feature set.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/AwXP27Y.png&quot; alt=&quot;100 hours developing Pangolin Catlaog&quot;&gt;&lt;/p&gt;
&lt;h2&gt;7. What Pangolin Is - and What It Isn’t&lt;/h2&gt;
&lt;p&gt;Pangolin exists. You can run it. You can click around, create catalogs, register assets, request access, and vend credentials across clouds.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/yWDNUcd.png&quot; alt=&quot;Pangolin Catalog Exists&quot;&gt;&lt;/p&gt;
&lt;p&gt;That said, I don’t see Pangolin as a production catalog. I don’t plan to invest heavily beyond bug fixes and minor improvements. For a truly open, production-ready lakehouse catalog, Apache Polaris is still the best option today. If you want a managed path, platforms like Dremio Catalog, which build on Polaris, handle the complex parts for you.&lt;/p&gt;
&lt;p&gt;Pangolin serves a different purpose. It’s a proof of concept. It shows what can happen when a community-oriented project tries to bring versioning, federation, governance, business metadata, and access workflows together in one place.&lt;/p&gt;
&lt;p&gt;If you’re a lakehouse nerd like me, Pangolin might be fun to explore. If it sparks ideas or nudges other projects to co-locate these features sooner, then it did its job.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.alexmerced.com/data&quot;&gt;Make sure to follow me on linkedin and substack&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/81PKZJp.png&quot; alt=&quot;Pangolin Catalog is a Question Made Real&quot;&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Are Recursive Language Models?</title><link>https://iceberglakehouse.com/posts/2026-01-recursive-langauge-models/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-recursive-langauge-models/</guid><description>
**Get Data Lakehouse Books:**

- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](h...</description><pubDate>Sat, 10 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Recursive Language Models (RLMs) are language models that call themselves.&lt;/p&gt;
&lt;p&gt;That sounds strange at first - but the idea is simple. Instead of answering a question in one go, an RLM breaks the task into smaller parts, then asks itself those sub-questions. It builds the answer step by step, using structured function calls along the way.&lt;/p&gt;
&lt;p&gt;This is different from how standard LLMs work. A typical model tries to predict the full response directly from a prompt. If the task has multiple steps, it has to manage them all in a single stream of text. That can work for short tasks, but it often falls apart when the model needs to remember intermediate results or reuse the same logic multiple times.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xqzdf2stxn61a1jhjcd6.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;RLMs don’t try to do everything at once. They write and execute structured calls: like &lt;code&gt;CALL(&amp;quot;question&amp;quot;, args)&lt;/code&gt;, inside their own output. The system sees this call, pauses the main response, evaluates the subtask, then inserts the result and continues. It’s a recursive loop: the model is both the planner and the executor.&lt;/p&gt;
&lt;p&gt;This gives RLMs a kind of dynamic memory and control flow. They can stop, plan, re-enter themselves with new input, and combine results. That’s what makes them powerful - and fundamentally different from the static prompting methods most models use today.&lt;/p&gt;
&lt;h2&gt;What Problem Do RLMs Solve?&lt;/h2&gt;
&lt;p&gt;Language models are good at sounding smart. But when the task involves multiple steps, especially ones that depend on each other, standard models often fail.&lt;/p&gt;
&lt;p&gt;Why? Because they generate everything in a straight line.&lt;/p&gt;
&lt;p&gt;If you ask a regular LLM to solve a logic puzzle, it has to juggle the entire solution in one pass. There’s no mechanism to stop, break the task apart, and reuse parts of its own reasoning. It has no structure - just one long stream of text.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nqyodm8imk6zrniirz3h.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Prompt engineering helps, but only up to a point. You can ask the model to “think step by step” or “show your work,” and that can improve results. But these tricks don’t change how the model actually runs. It still generates everything in one session, with no built-in way to modularize or reuse logic.&lt;/p&gt;
&lt;p&gt;Recursive Language Models change this. They treat complex tasks as programs. The model doesn’t just answer - it writes code-like calls to itself. Those calls are evaluated in real time, and their results are folded back into the response.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9mrs3yiylirta0d7mp4b.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This lets RLMs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reuse their own logic.&lt;/li&gt;
&lt;li&gt;Focus on one part of the task at a time.&lt;/li&gt;
&lt;li&gt;Scale to deeper or more recursive problems.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, RLMs solve the structure problem. They bring composability and control into language generation - two things that most LLMs still lack.&lt;/p&gt;
&lt;h2&gt;How Do RLMs Actually Work?&lt;/h2&gt;
&lt;p&gt;At the core of Recursive Language Models is a simple but powerful loop: generate, detect, call, repeat.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/is9rohgc2g1lbt2nrvvb.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s how it plays out:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The model receives a prompt.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It starts generating a response.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When it hits a subtask, it emits a structured function call&lt;/strong&gt; - something like &lt;code&gt;CALL(&amp;quot;Summarize&amp;quot;, &amp;quot;text goes here&amp;quot;)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The system pauses&lt;/strong&gt;, evaluates that call by feeding it back into the same model, and gets a result.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The result is inserted&lt;/strong&gt;, and the original response resumes.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process can happen once - or dozens of times inside a single response.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r8r9qwefzhnhr8bxxyyb.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Let’s take a concrete example. Suppose you ask an RLM to explain a complicated technical article. Instead of trying to summarize the whole thing at once, the model might first break the article into sections. Then it could issue recursive calls to summarize each section individually. After that, it could combine those pieces into a final answer.&lt;/p&gt;
&lt;p&gt;So what’s actually new here?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model isn’t just generating text. It’s &lt;em&gt;controlling execution&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Each function call is explicit and machine-readable. It’s not hidden in plain text.&lt;/li&gt;
&lt;li&gt;The model learns not just &lt;em&gt;what&lt;/em&gt; to say, but &lt;em&gt;when&lt;/em&gt; to delegate subtasks to itself.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3oreu6lmnf9ui304nhs8.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This design introduces modular reasoning. It’s closer to programming than prompting. And it’s what makes RLMs capable of solving longer, deeper, and more compositional tasks than traditional LLMs.&lt;/p&gt;
&lt;h2&gt;How Are RLMs Different From Reasoning Models?&lt;/h2&gt;
&lt;p&gt;It’s easy to confuse Recursive Language Models with models designed for reasoning. After all, both aim to solve harder, multi-step problems. But they take very different paths.&lt;/p&gt;
&lt;p&gt;Reasoning models try to think better within a fixed response. They rely on prompting tricks (“Let’s think step by step”), fine-tuning, or architectural tweaks to encourage more logical answers. But they still generate their full output in one go. There’s no built-in structure or recursion - just better text generation.&lt;/p&gt;
&lt;p&gt;Recursive Language Models go further. They change how language models &lt;em&gt;run&lt;/em&gt;, not just how they &lt;em&gt;think&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9ki8oqyq0gpy55b4kykg.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s the key distinction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reasoning models&lt;/strong&gt; operate in a flat, linear space. They can simulate step-by-step thinking, but they don’t control execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RLMs&lt;/strong&gt; introduce a real control flow. They can pause, emit a sub-call, re-enter themselves, and build results incrementally.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qys887jwi9f157l5rg3q.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Think of it this way: reasoning models try to write better essays. RLMs write and run programs.&lt;/p&gt;
&lt;p&gt;This also makes RLMs easier to inspect and debug. Each recursive call is explicit. You can see the full tree of operations the model performed - what it asked, what it answered, and how it combined the results. That transparency is rare in LLM workflows, and it opens the door to more robust systems.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0akxrwtfly8hqkn5opxt.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;So while reasoning models stretch the limits of static prompting, RLMs redefine what a model can do at runtime.&lt;/p&gt;
&lt;h2&gt;Why Recursion Changes What LLMs Can Do&lt;/h2&gt;
&lt;p&gt;Recursion isn’t just a technical upgrade - it’s a shift in what language models are capable of.&lt;/p&gt;
&lt;p&gt;With recursion, models don’t have to guess the whole answer in one pass. They can build it piece by piece, reusing their own capabilities as needed. This unlocks new behaviors that standard models struggle with.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/byceucjiz5m6nl13e57b.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s what that looks like in practice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Logic puzzles&lt;/strong&gt;: Instead of brute-forcing a full solution, an RLM can write out each rule, evaluate sub-cases, and combine the results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math word problems&lt;/strong&gt;: The model can break a complex problem into steps, solve each one recursively, and verify intermediate answers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code generation&lt;/strong&gt;: RLMs can draft a function, then call themselves to write test cases, fix bugs, or generate helper functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof generation&lt;/strong&gt;: For theorem proving, recursion lets the model build a proof tree, checking smaller lemmas along the way.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the paper’s experiments, RLMs outperformed non-recursive baselines on multi-step benchmarks. They were also &lt;em&gt;more efficient&lt;/em&gt;. Recursive calls reduced total token usage, because the model could reuse logic instead of repeating it .&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q6896v8g85mmclp8ipqd.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is a key point: recursion isn’t just about accuracy. It’s also about &lt;em&gt;efficiency&lt;/em&gt; and &lt;em&gt;composability&lt;/em&gt;. Instead of scaling linearly with problem size, RLMs can scale logarithmically by solving smaller pieces and reusing solutions.&lt;/p&gt;
&lt;p&gt;That makes them a better fit for tasks where reasoning depth grows quickly - exactly the kind of problems LLMs are starting to face in real-world applications.&lt;/p&gt;
&lt;h2&gt;Why This Matters Now&lt;/h2&gt;
&lt;p&gt;Language models are everywhere - but most still follow a simple pattern: input goes in, output comes out. That’s fine for quick answers or lightweight tasks. But for anything complex, it’s not enough.&lt;/p&gt;
&lt;p&gt;Today, developers are building agents, chains, and tool-using systems on top of LLMs. These wrappers simulate structure, but they’re often fragile. They rely on prompt hacking, regex parsing, and external orchestration to manage what the model can’t do natively.&lt;/p&gt;
&lt;p&gt;Recursive Language Models offer a cleaner path. Instead of bolting on structure from the outside, they build it in.&lt;/p&gt;
&lt;p&gt;This matters for a few reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fewer moving parts&lt;/strong&gt;: RLMs remove the need for external chains or custom routing logic. The model decides when and how to branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Greater transparency&lt;/strong&gt;: Each recursive call is visible and traceable. You can audit what the model did, step by step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better generalization&lt;/strong&gt;: Once trained to use recursion, the model can apply it flexibly across domains - math, code, reasoning, even planning.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And we’re just getting started. RLMs are early, but they hint at a broader shift: treating models not just as generators, but as runtime environments. That opens the door to future systems where models can plan, act, and adapt on their own, with clear structure behind every step.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u2g8vytnrdnb5czywiiv.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;If the last few years were about making LLMs sound smart, the next few might be about making them &lt;em&gt;think&lt;/em&gt; with structure. That’s where recursion fits in.&lt;/p&gt;
&lt;h2&gt;Conclusion: A New Way to Think with Language Models&lt;/h2&gt;
&lt;p&gt;Recursive Language Models aren’t just a tweak to existing LLMs. They represent a shift in how models operate.&lt;/p&gt;
&lt;p&gt;Instead of treating every task as a one-shot prediction, RLMs break problems into parts, solve them recursively, and combine the results. That gives them something most language models still lack: structure.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/59pdnsjnqmulb1dlrq53.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This structure matters. It makes models more reliable on complex tasks. It makes their reasoning easier to follow. And it opens the door to new capabilities: like planning, verifying, or adapting, without needing complex external systems.&lt;/p&gt;
&lt;p&gt;We’re still early in this space. But the idea is simple and powerful: give models the tools to use themselves. From there, a new class of language systems can emerge - not just fluent, but recursive, modular, and built to handle depth.&lt;/p&gt;
&lt;p&gt;RLMs don’t just make better answers. They make better models.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025 Year in Review Apache Iceberg, Polaris, Parquet, and Arrow</title><link>https://iceberglakehouse.com/posts/2025-12-2025-year-in-review-iceberg-arrow-polaris-parquet/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-12-2025-year-in-review-iceberg-arrow-polaris-parquet/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-12-2025-year-in-review-iceberg-arro...</description><pubDate>Mon, 29 Dec 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-12-2025-year-in-review-iceberg-arrow-polaris-parquet/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;The open lakehouse is no longer a concept. In 2025, key Apache projects matured, making data warehouse performance on object storage a practical reality. This post walks through the most critical developments in four of those projects: Iceberg, Polaris, Parquet, and Arrow. Each is building a critical layer for an open, engine-agnostic analytics stack.&lt;/p&gt;
&lt;p&gt;We start with Iceberg.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg&lt;/h2&gt;
&lt;p&gt;Iceberg spent 2025 delivering the core elements of Format Version 3, while setting the stage for a more indexable and cache-friendly V4 format. Its release cadence remained steady and focused. The project shipped three main versions: 1.8.0 in February, 1.9.0 in April, and 1.10.0 in September.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;Iceberg 1.8.0 introduced deletion vectors, default column values, and row-level lineage metadata. These features help engines express updates more efficiently while tracking the origin of each record .&lt;/p&gt;
&lt;p&gt;Version 1.9.0 expanded type support. Iceberg now includes a &lt;code&gt;variant&lt;/code&gt; type for semi-structured data and geospatial types for geometry-based filtering. The release also added nanosecond timestamps and improved the semantics of equality deletes .&lt;/p&gt;
&lt;p&gt;By 1.10.0, the project had added encryption key metadata, cleanup logic for orphaned delete vectors, and full compatibility with Spark 4.0 . Partition statistics were made incremental to reduce overhead in large-scale table planning.&lt;/p&gt;
&lt;p&gt;These changes matter. Deletion vectors reduce the cost of updates. Default column values simplify table evolution. Variant support opens the door to querying nested JSON and evolving schemas. Together, these features make Iceberg more expressive and more efficient.&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;The community has started preparing for Format V4. Key goals include native index support and a formal caching model. The Iceberg dev list also agreed to raise the Java baseline to JDK 17, clearing the way for future performance and security improvements .&lt;/p&gt;
&lt;p&gt;Work is also underway to extend the REST catalog spec. This will improve consistency across catalogs like Polaris and make multi-engine deployments behave more predictably.&lt;/p&gt;
&lt;p&gt;All of this reflects a clear direction. Iceberg is not only stable, but optimized. It is now equipped to support warehouse workloads with ACID guarantees, even on cloud object storage.&lt;/p&gt;
&lt;h2&gt;Apache Polaris&lt;/h2&gt;
&lt;p&gt;Polaris is a new incubating project, but in 2025 it made a fast entrance. Its purpose is simple: act as a shared catalog and governance layer for Iceberg tables across multiple query engines. This includes Spark, Flink, Dremio, Trino, StarRocks, and any system that supports Iceberg&apos;s REST catalog protocol.&lt;/p&gt;
&lt;h3&gt;Why Polaris matters&lt;/h3&gt;
&lt;p&gt;Today, companies often manage Iceberg tables across multiple engines. Each engine needs a way to authenticate, authorize, and operate on metadata safely. Polaris fills that gap. It provides a consistent API, stores policies centrally, and handles short-term credential vending through built-in integrations with cloud providers.&lt;/p&gt;
&lt;p&gt;This makes Polaris one of the first Iceberg-native catalogs to support full multi-engine access, with RBAC and table-level security as first-class features.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;Polaris released three versions in its first year: 1.0.0-incubating in July, 1.1.0 in September, and 1.2.0 in October.&lt;/p&gt;
&lt;p&gt;The first release included core catalog APIs, a PostgreSQL-backed persistence layer, Quarkus runtime, and initial support for snapshot and compaction policies. It also supported external identity providers, ETag-based caching, and federated metadata views .&lt;/p&gt;
&lt;p&gt;Version 1.1.0 added Hive Metastore integration, support for S3-compatible stores like MinIO, and improvements to modularity and CLI tooling .&lt;/p&gt;
&lt;p&gt;Version 1.2.0 focused on governance. It expanded RBAC, introduced fine-grained update permissions, and added event logging. AWS Aurora IAM login support also shipped, helping teams standardize credentials across engines .&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;Polaris is not standing still. Active mailing list discussions show interest in idempotent commit operations, improved retries, and broader NoSQL compatibility. The project is also planning to support Delta Lake tables through its generic table APIs.&lt;/p&gt;
&lt;p&gt;Polaris is already production-ready for Iceberg. It supports time travel, commit retries, STS credential vending, and a policy-based governance model. These capabilities make it the metadata backbone of an open lakehouse.&lt;/p&gt;
&lt;h2&gt;Apache Parquet&lt;/h2&gt;
&lt;p&gt;Parquet is the disk format most Iceberg tables use. In 2025, the project focused on performance and long-term maintainability. While its interface has changed little, its internals received key upgrades.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;The biggest release was Parquet Java 1.16.0 in September. It removed legacy Hadoop 2 support, raised the Java baseline to 11, and enabled vectorized reads by default. These changes help projects like Iceberg, Trino, and Spark take advantage of faster scan paths with less configuration .&lt;/p&gt;
&lt;p&gt;The update also refreshed core dependencies like Protobuf and Jackson, fixed bugs in nested field casting, and added CLI support for printing size statistics. For teams managing data layout at scale, this makes table introspection simpler and safer.&lt;/p&gt;
&lt;p&gt;On the C++ side, version 12.0 of the Parquet format finalized support for Decimal32 and Decimal64 encodings. These types make aggregations and filters on fixed-point numbers more space-efficient .&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;The Parquet community has begun discussing what a V3 format might look like. Topics include FSST-based string encoding, cleaner metadata layouts, and faster bloom filter indexing. These ideas aim to reduce scan times and improve filter pushdown without breaking compatibility .&lt;/p&gt;
&lt;p&gt;The dev list also revisited lingering V2 features like optional checksums and page-level statistics. There is consensus that these will stabilize in 2026, completing the long tail of V2 work before any format transition.&lt;/p&gt;
&lt;p&gt;Parquet’s future is evolutionary, not disruptive. The team is focused on speed, compatibility, and precision. That’s exactly what Iceberg and other engines need from their storage format.&lt;/p&gt;
&lt;h2&gt;Apache Arrow&lt;/h2&gt;
&lt;p&gt;Arrow provides the in-memory columnar format that many engines use to exchange data without copying or re-encoding. In 2025, the project extended its feature set, added new bindings, and continued improving compute performance.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;Arrow released versions 20.0.0 in April, 21.0.0 in July, and 22.0.0 in October. Each brought changes across the stack, including C++, Python, Java, and R bindings.&lt;/p&gt;
&lt;p&gt;The October release expanded compute functions with new regex matchers, selection kernels, and logical operators. It also improved CSV read/write performance, added support for &lt;code&gt;attrs&lt;/code&gt; in Pandas DataFrames, and stabilized Decimal32 and Decimal64 support across languages .&lt;/p&gt;
&lt;p&gt;Arrow Flight, the RPC layer, shipped a working SQL client implementation. This lays the groundwork for distributed query pushdown using Arrow buffers. Timezone-aware types also advanced, with the community approving a new &lt;code&gt;TimestampWithOffset&lt;/code&gt; type to better handle UTC offsets in analytical workflows .&lt;/p&gt;
&lt;p&gt;Language support improved too. Arrow released official wheels for modern Linux platforms, added MATLAB bindings, and expanded test coverage for R and Julia. These improvements reduce friction when adopting Arrow across new platforms.&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;Arrow’s roadmap points toward broader Flight SQL adoption, faster filter and projection kernels, and more alignment between language libraries. Mailing list discussion shows active work on offset encoding, enum types, and compression improvements.&lt;/p&gt;
&lt;p&gt;More importantly, Arrow is no longer just a format. It’s becoming an interoperability layer for lakehouse engines. With zero-copy sharing across Spark, Dremio, DuckDB, and beyond, Arrow enables the low-latency experience users expect from a warehouse.&lt;/p&gt;
&lt;p&gt;Arrow’s 2025 work reinforced that direction: fast, portable, and deeply integrated with the tools that matter.&lt;/p&gt;
&lt;h2&gt;Wrapping up&lt;/h2&gt;
&lt;p&gt;Apache Iceberg, Polaris, Parquet, and Arrow all pushed forward in 2025. Each project focused on practical features that improve performance, governance, or compatibility. Together, they form a foundation for a warehouse experience on open data.&lt;/p&gt;
&lt;p&gt;This year’s progress wasn’t about experimentation. It was about consolidation. The features that shipped: from deletion vectors to vectorized reads to Flight SQL, are already in production. They make it easier to build, operate, and scale lakehouse systems.&lt;/p&gt;
&lt;p&gt;In 2026, expect the conversation to shift from format maturity to engine convergence. With multi-engine catalogs, index-aware tables, and in-memory interoperability in place, the future looks a lot more accessible. And a lot faster.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>dremioframe &amp; iceberg - Pythonic interfaces for Dremio and Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2025-12-dremioframe-and-iceframe/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-12-dremioframe-and-iceframe/</guid><description>
Modern data teams want simple tools to work with Iceberg tables and Dremio. Two new Python libraries now make that work easier. The first is DremioFr...</description><pubDate>Fri, 05 Dec 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Modern data teams want simple tools to work with Iceberg tables and Dremio. Two new Python libraries now make that work easier. The first is DremioFrame. It gives you a clear set of functions for managing your Dremio Cloud or Dremio Software project through code. The second is IceFrame. It gives you a direct way to create and maintain Iceberg tables using PyIceberg and Polars with native extensions. Both libraries are in alpha. This is the best time to try them, share your ideas, and report issues.&lt;/p&gt;
&lt;p&gt;You can test them with a free 30-day Dremio Cloud trial that includes $400 in credits. Sign up &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;here to get started&lt;/a&gt; . The trial includes a built-in Apache Polaris-based Iceberg catalog (on the ui you&apos;ll see a namespaces section, that&apos;s the catalog), so you can create tables and explore them from both libraries. This lets you know how the tools fit into real workflows with no setup.&lt;/p&gt;
&lt;p&gt;The goal of both libraries is simple. They remove friction. They give you short, readable code. They help you move from idea to result with less effort. They both have built-in AI Agents for assisting you generate code using the library and more. Early feedback from real users will shape their future. Your tests and your questions will guide the next steps. This article introduces the two projects and shows how they work together.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/Sj9EKFC.png&quot; alt=&quot;dremioframe and iceframe&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why These Libraries Exist&lt;/h2&gt;
&lt;p&gt;Python is the language many teams use for data work. People write scripts, build pipelines, and test ideas in notebooks. Yet working with Iceberg tables or the Dremio REST API often means long code and many repeated steps. These two libraries remove that weight.&lt;/p&gt;
&lt;p&gt;DremioFrame gives you a direct way to manage your Dremio catalog, users, views, and jobs. You write clear code that creates folders, defines views, and handles security rules. You no longer need to build each API request by hand.&lt;/p&gt;
&lt;p&gt;IceFrame gives you a focused set of tools for Iceberg tables. You can compact files, evolve partitions, and run maintenance tasks with short commands.&lt;/p&gt;
&lt;p&gt;Both libraries aim to shorten the path from idea to action. They help you test new patterns, share scripts with your team, and work with Iceberg and Dremio in a direct way.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/NgAbTNY.png&quot; alt=&quot;How dremioframe and iceframe work&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Meet DremioFrame&lt;/h2&gt;
&lt;p&gt;DremioFrame is a Python client for Dremio Cloud and Dremio Software. It wraps the REST API in a clean set of methods. You can manage sources, folders, views, tags, and security rules with short commands. You can also run SQL and work with query results as DataFrames.&lt;/p&gt;
&lt;p&gt;The library gives you a clear structure. You access the catalog through &lt;code&gt;client.catalog&lt;/code&gt;. You manage users and roles through &lt;code&gt;client.admin&lt;/code&gt;. You can also manage reflections that speed up queries. Each action is a direct Python call that maps to a known Dremio feature.&lt;/p&gt;
&lt;p&gt;The design is simple. You write code that creates a source, builds a view, assigns a policy, or deletes an item. You do not handle request URLs or version tags yourself. This helps teams move faster and keep their scripts readable.&lt;/p&gt;
&lt;p&gt;DremioFrame fits well in automation. You can create large batches of folders or datasets through parallel calls. You can also use it in small scripts that update a single view. The goal is to make Dremio easier to use in everyday work.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/1G2hEi2.png&quot; alt=&quot;dremioframe leverages the Dremio Engine&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Meet IceFrame&lt;/h2&gt;
&lt;p&gt;IceFrame is a Python library that gives you direct control over Iceberg tables. It focuses on clear commands that help you maintain data and keep tables fast. You can compact small files, sort data, evolve partitions, and clear old snapshots. Each task uses a short call that reflects the action you want to take.&lt;/p&gt;
&lt;p&gt;The library also supports Iceberg views when the catalog allows it. You can define a view with a simple SQL string and replace it when your logic changes. You can also call stored procedures that handle cleanup and maintenance. This includes rewriting files, removing orphan files, and keeping only recent snapshots.&lt;/p&gt;
&lt;p&gt;IceFrame includes an AI assistant for table exploration. You can ask questions in plain language. The tool can show schemas, write example code, and suggest filters or joins. This helps new users learn how the data is shaped and how to work with it.&lt;/p&gt;
&lt;p&gt;The goal is steady control with minimal code. You keep your tables healthy and easy to query. You also gain tools to understand your data without long setup or manual checks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/E9p232c.png&quot; alt=&quot;iceframe&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Dremio Cloud Is the Best Place to Try Them&lt;/h2&gt;
&lt;p&gt;Dremio Cloud gives you a smooth way to test both libraries. The trial includes a built-in Iceberg catalog with hosted storage (you can use your own storage with a non-trial account), so you can create tables right away. You do not need to run a separate service or set up extra storage. You write code, create a table, and see it in the Dremio catalog within seconds.&lt;/p&gt;
&lt;p&gt;The free 30-day trial includes $400 in credits. You can sign up at &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;the get started page&lt;/a&gt;. This gives you enough room to explore IceFrame operations, build views with DremioFrame, and test how the two tools work together.&lt;/p&gt;
&lt;p&gt;The setup is light. You create a personal access token, connect through Python, and begin writing code. You can also switch between the console and your scripts to see changes in real time. This makes the trial a strong place for experiments, quick tests, and early feedback.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/ipzlvnA.png&quot; alt=&quot;Dremioframe in action&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/1XHwCGQ.png&quot; alt=&quot;Iceframe in Action&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Using Both Libraries Together&lt;/h2&gt;
&lt;p&gt;You can use IceFrame and DremioFrame in the same workflow. IceFrame lets you create and shape Iceberg tables locally. DremioFrame lets you see those tables in the catalog, build views on top of them alongside other databases/lakes/warehouses, and apply rules for access or masking. This gives you one flow from data creation to data use.&lt;/p&gt;
&lt;p&gt;A simple pattern looks like this. You can write to and manage lightweight Iceberg tables using iceframe for local processing, and use dremioframe to work with Dremio for extensive data processing and query federation with databases/lakes/warehouses, and to curate a semantic and governance layer on top of your data.&lt;/p&gt;
&lt;p&gt;You do not move between many tools. You do not manage long request bodies. You write small blocks of code that express the action you need. This helps teams test new ideas and keep their work easy to read and share.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/BBJMzRh.png&quot; alt=&quot;Using dremioframe and iceframe together&quot;&gt;&lt;/p&gt;
&lt;h2&gt;How to Get Started&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-cloud-dremioframe&quot;&gt;Dremioframe Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://pypi.org/project/dremioframe/&quot;&gt;Dremioframe on Pypi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/AlexMercedCoder/iceframe&quot;&gt;Iceframe Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://pypi.org/project/iceframe/&quot;&gt;Iceframe on Pypi&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can install both libraries with a single step. Run &lt;code&gt;pip install dremioframe&lt;/code&gt; and &lt;code&gt;pip install iceframe&lt;/code&gt;. You can then import them in any script or notebook. This gives you direct access to the Dremio catalog and your Iceberg tables.&lt;/p&gt;
&lt;p&gt;You do not need to clone the repos to use the tools. Cloning is only needed if you want to read the source code or contribute changes. Most users will install the packages from PyPI and begin writing code right away.&lt;/p&gt;
&lt;p&gt;After installation, you create a personal access token in Dremio Cloud. You pass that token to DremioFrame when you create the client. You also point IceFrame at your Iceberg catalog. Once this is done, you can create tables, define views, and run cleanup tasks with short commands.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Examples&lt;/h2&gt;
&lt;p&gt;The two libraries serve different roles, but they work well together. The examples below show how to connect, run a simple query, and create a table in each library. The code stays short in both cases.&lt;/p&gt;
&lt;h3&gt;Connect&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe.client import DremioClient

client = DremioClient(
    token=&amp;quot;YOUR_DREMIO_CLOUD_PAT&amp;quot;,
    project_id=&amp;quot;YOUR_PROJECT_ID&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IceFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from iceframe import IceFrame

ice = IceFrame(
    {
        &amp;quot;uri&amp;quot;: &amp;quot;https://catalog.dremio.cloud/api/iceberg/v1&amp;quot;,
        &amp;quot;token&amp;quot;: &amp;quot;YOUR_DREMIO_CLOUD_PAT&amp;quot;,
        &amp;quot;project_id&amp;quot;: &amp;quot;YOUR_PROJECT_ID&amp;quot;
    }
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Run a Query&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.sql.run(&amp;quot;SELECT 1 AS value&amp;quot;)
print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IceFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;result = ice.query(&amp;quot;some_table&amp;quot;).limit(10).execute()
print(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Create a Table&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame&lt;/strong&gt;
You create a view or dataset through the catalog. Here is a simple view example.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client.catalog.create_view(
    path=[&amp;quot;Samples&amp;quot;, &amp;quot;small_view&amp;quot;],
    sql=&amp;quot;SELECT * FROM Samples.samples.Employees&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IceFrame&lt;/strong&gt;
You create an Iceberg table by writing data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from datetime import datetime

data = [
    {&amp;quot;id&amp;quot;: 1, &amp;quot;name&amp;quot;: &amp;quot;Ada&amp;quot;, &amp;quot;created_at&amp;quot;: datetime.utcnow()},
    {&amp;quot;id&amp;quot;: 2, &amp;quot;name&amp;quot;: &amp;quot;Max&amp;quot;, &amp;quot;created_at&amp;quot;: datetime.utcnow()}
]

ice.create_table(&amp;quot;my_table&amp;quot;, data=data)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These examples show the contrast. DremioFrame works with the Dremio catalog. IceFrame works with Iceberg storage. When used together, they give you a complete path from data creation to query.&lt;/p&gt;
&lt;h2&gt;Query Builder Examples&lt;/h2&gt;
&lt;p&gt;Both libraries include a query builder. Each builder keeps the code readable and avoids long SQL strings. The examples below show how each one works.&lt;/p&gt;
&lt;h3&gt;DremioFrame Query Builder&lt;/h3&gt;
&lt;p&gt;DremioFrame can build SQL through a fluent API. You call &lt;code&gt;client.table(...)&lt;/code&gt; to start. You then add filters, selects, joins, or limits. The builder compiles the final SQL when you run the query.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Start with a table in the Dremio catalog
df = (
    client.table(&amp;quot;Samples.samples.Employees&amp;quot;)
        .select(&amp;quot;employee_id&amp;quot;, &amp;quot;full_name&amp;quot;, &amp;quot;department&amp;quot;)
        .filter(&amp;quot;department = &apos;Engineering&apos;&amp;quot;)
        .limit(5)
        .run()
)

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern helps when you want to build queries from variables or reuse parts of the logic. The SQL stays clean, and the structure is easy to read.&lt;/p&gt;
&lt;h3&gt;IceFrame Query Builder&lt;/h3&gt;
&lt;p&gt;IceFrame includes a builder for Iceberg tables. You call &lt;code&gt;ice.query(&amp;quot;table_name&amp;quot;)&lt;/code&gt; to start. You can then filter rows, pick columns, join tables, or sort results. The builder runs the final plan with &lt;code&gt;execute()&lt;/code&gt;. It will determine what parts of the query should be used for Iceberg predicate pushdown and what should be handled after scanning the data for better peformance.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from iceframe.expressions import Column

result = (
    ice.query(&amp;quot;my_table&amp;quot;)
        .filter(Column(&amp;quot;id&amp;quot;) &amp;gt; 10)
        .select(&amp;quot;id&amp;quot;, &amp;quot;name&amp;quot;)
        .sort(&amp;quot;id&amp;quot;)
        .limit(5)
        .execute()
)

print(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern keeps the logic simple. You express intent with short steps. The code stays close to how you think about the data.&lt;/p&gt;
&lt;p&gt;Both builders help you avoid long SQL strings. They also make it easier to share examples with your team and adapt them to new cases.&lt;/p&gt;
&lt;h2&gt;Agents and Procedures&lt;/h2&gt;
&lt;p&gt;Both libraries include features that help you work faster with less manual code. Each tool offers an agent that can guide you through common tasks. IceFrame also includes direct access to Iceberg procedures that keep tables healthy.&lt;/p&gt;
&lt;h3&gt;Agents&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame Agent&lt;/strong&gt;&lt;br&gt;
DremioFrame includes an optional agent that can help you work with DremioFrame and Dremio. It can help you write queries, write DremioFrame scripts, and much more.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IceFrame Agent&lt;/strong&gt;&lt;br&gt;
IceFrame includes a chat agent for Iceberg tables. You can ask about table schemas, filters, and joins. The agent can write Python code for common IceFrame tasks. It can also explain how to compact files or clean snapshots. This helps new users understand how each feature works. It also helps teams share patterns in a simple way.&lt;/p&gt;
&lt;h3&gt;IceFrame Procedures&lt;/h3&gt;
&lt;p&gt;IceFrame gives you access to Iceberg maintenance procedures. These keep data clean and reduce the cost of reading tables. You call each procedure with a short command.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Rewrite data files
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;rewrite_data_files&amp;quot;, target_file_size_mb=256)

# Remove old snapshots
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;expire_snapshots&amp;quot;, older_than_ms=7 * 24 * 3600 * 1000)

# Remove orphan files
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;remove_orphan_files&amp;quot;)

# Fast-forward a branch
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;fast_forward&amp;quot;, to_branch=&amp;quot;main&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These steps help you keep tables tidy. They reduce file counts, remove unused data, and keep history at a safe size. You can schedule them or run them by hand. Paired with the agent, you have a clear path from learning a task to running it.&lt;/p&gt;
&lt;p&gt;The two libraries share a goal. They help you act faster and with less effort. The agents guide you. The procedures handle the work that keeps your tables stable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/F1xxpWP.png&quot; alt=&quot;dremioframe and iceframe&quot;&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introducing dremioframe - A Pythonic DataFrame Interface for Dremio</title><link>https://iceberglakehouse.com/posts/2025-11-introducing-dremioframe-dataframe-python-library/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-11-introducing-dremioframe-dataframe-python-library/</guid><description>
If you&apos;re a data analyst or Python developer who prefers chaining expressive `.select()` and `.mutate()` calls over writing raw SQL, you&apos;re going to ...</description><pubDate>Sat, 29 Nov 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;If you&apos;re a data analyst or Python developer who prefers chaining expressive &lt;code&gt;.select()&lt;/code&gt; and &lt;code&gt;.mutate()&lt;/code&gt; calls over writing raw SQL, you&apos;re going to love &lt;code&gt;dremioframe&lt;/code&gt; : the unofficial Python DataFrame library for Dremio (currently in Alpha).&lt;/p&gt;
&lt;p&gt;Dremio has always made it easy to query across cloud and on-prem datasets using SQL. Some users prefer the ergonomics of DataFrame-style APIs, where transformations are composable, readable, and testable : especially when working in notebooks or building data pipelines in Python.&lt;/p&gt;
&lt;p&gt;That’s where &lt;code&gt;dremioframe&lt;/code&gt; comes in. It bridges the gap between SQL and Python by letting you build Dremio queries using intuitive DataFrame methods like &lt;code&gt;.select()&lt;/code&gt;, &lt;code&gt;.filter()&lt;/code&gt;, &lt;code&gt;.mutate()&lt;/code&gt;, and more. Under the hood, it still generates SQL and pushes down queries to Dremio, but you write it the way you&apos;re used to in Python.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want to try this yourself?&lt;br&gt;
You can &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;sign up for a free 30-day trial of Dremio Cloud&lt;/a&gt;, which includes full access to Agentic AI features, native Apache Iceberg integration, and support for all Iceberg catalogs (e.g. AWS Glue, Nessie, Snowflake, Hive, etc.).&lt;br&gt;
Or if you&apos;d rather run Dremio locally for free, check out the &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition setup guide&lt;/a&gt;. Community Edition doesn’t include Agentic AI or full catalog support, but still lets you run federated queries and work with some Iceberg catalogs like Glue and Nessie.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this post, we’ll walk through how to get started with &lt;code&gt;dremioframe&lt;/code&gt; - from installing the library and configuring authentication, to writing powerful queries using SQL, DataFrame chaining, and expression builders. We’ll wrap up with a look at some of the more advanced features it unlocks for analytics, ingestion, and administration.&lt;/p&gt;
&lt;p&gt;Let’s dive in.&lt;/p&gt;
&lt;h2&gt;Installing &lt;code&gt;dremioframe&lt;/code&gt; and Setting Up Your Environment&lt;/h2&gt;
&lt;p&gt;To get started, you’ll need to install the &lt;code&gt;dremioframe&lt;/code&gt; Python package. It’s published on PyPI and can be installed with pip:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install dremioframe
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once installed, you’ll need to set up authentication so the library can connect to your Dremio instance. The easiest way to do this is by setting environment variables in a .env file or directly in your shell.&lt;/p&gt;
&lt;h3&gt;For Dremio Cloud (recommended for full feature access):&lt;/h3&gt;
&lt;p&gt;In your .env file (or shell), set the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-env&quot;&gt;DREMIO_PAT=&amp;lt;your_personal_access_token&amp;gt;
DREMIO_PROJECT_ID=&amp;lt;your_project_id&amp;gt;
DREMIO_PROJECT_NAME=&amp;lt;your_project_name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These credentials can be generated in your Dremio Cloud account by going to project settings.&lt;/p&gt;
&lt;h4&gt;Don’t have an account?&lt;/h4&gt;
&lt;p&gt;&lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Start your free 30-day trial of Dremio Cloud&lt;/a&gt; to use dremioframe with Agentic AI, native Apache Iceberg support, and full access to all Iceberg catalogs.&lt;/p&gt;
&lt;h3&gt;For Dremio Community Edition (local setup):&lt;/h3&gt;
&lt;p&gt;If you&apos;re running Dremio locally, for example using the Community Edition, you’ll use a different set of environment variables or pass connection parameters directly in code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-env&quot;&gt;DREMIO_HOSTNAME=localhost
DREMIO_PORT=32010
DREMIO_USERNAME=admin
DREMIO_PASSWORD=password123
DREMIO_TLS=false
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Not ready for the cloud yet?&lt;/h4&gt;
&lt;p&gt;You can &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;try the Community Edition locally by following this guide&lt;/a&gt;.
It supports federated queries and works with some Iceberg catalogs (like AWS Glue and Nessie), though it doesn’t include the AI features or full catalog support available in Dremio Cloud and Enterprise.&lt;/p&gt;
&lt;p&gt;With your environment configured, you’re ready to connect to Dremio and start querying like a Pythonista.&lt;/p&gt;
&lt;h2&gt;Creating a Dremio Client (Sync or Async)&lt;/h2&gt;
&lt;p&gt;Once your environment is set up, the next step is to create a &lt;code&gt;DremioClient&lt;/code&gt; instance. This object is your entry point for running queries with &lt;code&gt;dremioframe&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Synchronous Client&lt;/h3&gt;
&lt;p&gt;For most use cases, the synchronous client is sufficient and straightforward to use. If you&apos;ve set your environment variables, you can initialize the client like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe.client import DremioClient

client = DremioClient()  # reads config from environment
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you prefer to pass credentials explicitly (useful in scripts or when using the Community Edition), you can do:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client = DremioClient(
    hostname=&amp;quot;localhost&amp;quot;,
    port=32010,
    username=&amp;quot;admin&amp;quot;,
    password=&amp;quot;password123&amp;quot;,
    tls=False  # Set to True if connecting over HTTPS
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This sets up a connection to your Dremio instance using standard authentication.&lt;/p&gt;
&lt;h3&gt;Asynchronous Client&lt;/h3&gt;
&lt;p&gt;If you&apos;re working in an async application (e.g., FastAPI, asyncio notebooks, etc.), dremioframe also supports an async client:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe.client import AsyncDremioClient

async with AsyncDremioClient(
    pat=&amp;quot;YOUR_PAT&amp;quot;,
    project_id=&amp;quot;YOUR_PROJECT_ID&amp;quot;
) as client:
    df = await client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
                    .select(&amp;quot;city&amp;quot;, &amp;quot;state&amp;quot;) \
                    .limit(5) \
                    .toPandas()
    print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The async API mirrors the sync one, but allows you to await results in event-driven applications.&lt;/p&gt;
&lt;h2&gt;Running a Pure SQL Query&lt;/h2&gt;
&lt;p&gt;Even though &lt;code&gt;dremioframe&lt;/code&gt; shines with its DataFrame-style interface, you can still execute raw SQL when needed using the &lt;code&gt;.query()&lt;/code&gt; method. This is helpful when you already have a SQL statement or want to run ad hoc queries.&lt;/p&gt;
&lt;p&gt;Here’s a simple example that selects city and state from the sample zips dataset:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.query(&amp;quot;&amp;quot;&amp;quot;
    SELECT city, state
    FROM Samples.samples.dremio.com.zips.json
    WHERE state = &apos;CA&apos;
    ORDER BY city
    LIMIT 10
&amp;quot;&amp;quot;&amp;quot;)

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result is a lightweight wrapper around a Pandas DataFrame, so you can treat it just like any other DataFrame in Python.&lt;/p&gt;
&lt;p&gt;You can also convert it explicitly to a Pandas DataFrame if needed:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;pdf = df.toPandas()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Dremio optimizes and accelerates this query under the hood, especially when you&apos;re on &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud&lt;/a&gt;, where features like autonomous reflection caching are automatic and don&apos;t need manual usage.&lt;/p&gt;
&lt;p&gt;If you prefer a hybrid approach, dremioframe allows mixing SQL and DataFrame APIs freely, which we&apos;ll explore next.&lt;/p&gt;
&lt;h2&gt;Querying with &lt;code&gt;.select()&lt;/code&gt; and SQL Functions&lt;/h2&gt;
&lt;p&gt;The real power of &lt;code&gt;dremioframe&lt;/code&gt; comes from its expressive, Pandas-like query builder. You can use &lt;code&gt;.select()&lt;/code&gt; to pick columns and include SQL expressions, just like in raw SQL : but with the clarity and structure of method chaining.&lt;/p&gt;
&lt;p&gt;Let’s say we want to select a few fields and apply a SQL function like &lt;code&gt;UPPER()&lt;/code&gt; to transform the state name:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
           .select(
               &amp;quot;city&amp;quot;,
               &amp;quot;state&amp;quot;,
               &amp;quot;pop&amp;quot;,
               &amp;quot;UPPER(state) AS state_upper&amp;quot;  # using SQL function
           ) \
           .filter(&amp;quot;pop &amp;gt; 100000&amp;quot;) \
           .limit(10) \
           .collect()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This returns 10 rows where the population is over 100,000 and includes the state_upper column that’s uppercased using Dremio’s SQL engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; even though you&apos;re using .select(), these expressions are passed through directly to Dremio and fully optimized as part of the SQL query plan.&lt;/p&gt;
&lt;p&gt;You can freely combine standard column names with SQL functions, aliases, expressions, and computed columns. This lets you build powerful queries without writing SQL directly.&lt;/p&gt;
&lt;p&gt;Want to experiment yourself? Spin up a &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;free Dremio Cloud workspace&lt;/a&gt; or try the &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition on your laptop&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Transforming Data with &lt;code&gt;.mutate()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;While &lt;code&gt;.select()&lt;/code&gt; is great for choosing and computing columns in one go, &lt;code&gt;.mutate()&lt;/code&gt; lets you &lt;strong&gt;add new derived columns&lt;/strong&gt; to an existing selection : much like &lt;code&gt;mutate()&lt;/code&gt; in R or &lt;code&gt;.assign()&lt;/code&gt; in Pandas.&lt;/p&gt;
&lt;p&gt;Let’s take the same query from before and add a new column that calculates population density by dividing population by a fictional land area (just for demo purposes):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
           .select(&amp;quot;city&amp;quot;, &amp;quot;state&amp;quot;, &amp;quot;pop&amp;quot;) \
           .mutate(
               pop_thousands=&amp;quot;pop / 1000&amp;quot;,               # create a scaled version
               pop_label=&amp;quot;CASE WHEN pop &amp;gt; 100000 THEN &apos;large&apos; ELSE &apos;small&apos; END&amp;quot;
           ) \
           .filter(&amp;quot;state = &apos;TX&apos;&amp;quot;) \
           .limit(10) \
           .collect()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pop_thousands&lt;/code&gt; is a new numeric column.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pop_label&lt;/code&gt; is a new string column based on a conditional expression using CASE WHEN.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can pass any SQL-compatible string expressions into .mutate() using column_name=expression syntax. The expressions are compiled into the underlying SQL query, so performance is fully optimized.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; You can chain multiple .mutate() calls if you prefer smaller, incremental steps.&lt;/p&gt;
&lt;p&gt;Try experimenting with your own columns! If you’re using &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud&lt;/a&gt;, you can test these queries on larger datasets with full query acceleration and Iceberg table support. Or run &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition&lt;/a&gt; locally to follow along with your own data.&lt;/p&gt;
&lt;h2&gt;Building Queries Programmatically with the Function API&lt;/h2&gt;
&lt;p&gt;For more complex or dynamic queries, &lt;code&gt;dremioframe&lt;/code&gt; provides a powerful &lt;strong&gt;function builder API&lt;/strong&gt; through the &lt;code&gt;F&lt;/code&gt; module : similar to how PySpark or dplyr work. This lets you construct expressions programmatically rather than writing raw SQL strings.&lt;/p&gt;
&lt;p&gt;Let’s rewrite the previous example using &lt;code&gt;F&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe import F

df = client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
           .select(
               F.col(&amp;quot;city&amp;quot;),
               F.col(&amp;quot;state&amp;quot;),
               F.col(&amp;quot;pop&amp;quot;),
               (F.col(&amp;quot;pop&amp;quot;) / 1000).alias(&amp;quot;pop_thousands&amp;quot;),
               F.case()
                 .when(F.col(&amp;quot;pop&amp;quot;) &amp;gt; 100000, F.lit(&amp;quot;large&amp;quot;))
                 .else_(F.lit(&amp;quot;small&amp;quot;))
                 .end()
                 .alias(&amp;quot;pop_label&amp;quot;)
           ) \
           .filter(F.col(&amp;quot;state&amp;quot;) == F.lit(&amp;quot;TX&amp;quot;)) \
           .limit(10) \
           .collect()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What’s happening here?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;F.col(&amp;quot;column_name&amp;quot;)&lt;/code&gt; references a column.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;F.case().when(...).else_(...).end()&lt;/code&gt; builds a SQL &lt;code&gt;CASE WHEN&lt;/code&gt; expression.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;F.lit(&amp;quot;value&amp;quot;)&lt;/code&gt; injects a literal value into the expression.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Arithmetic operations like / can be done using Python operators.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This method is especially useful when building queries dynamically : for instance, choosing which fields to include or filter based on user input.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; You can mix function objects with standard strings if needed. Just make sure each expression passed to &lt;code&gt;.select()&lt;/code&gt; or &lt;code&gt;.mutate()&lt;/code&gt; is either a string or an &lt;code&gt;F&lt;/code&gt; object.&lt;/p&gt;
&lt;p&gt;Want to try building dynamic queries against Iceberg tables or REST-ingested datasets? Sign up for &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud&lt;/a&gt; or use &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition&lt;/a&gt; to test these locally.&lt;/p&gt;
&lt;h2&gt;What Else Can &lt;code&gt;dremioframe&lt;/code&gt; Do?&lt;/h2&gt;
&lt;p&gt;By now, you’ve seen how &lt;code&gt;dremioframe&lt;/code&gt; lets you run SQL, build DataFrame-style queries, and programmatically compose logic using expressions. But there’s much more under the hood.&lt;/p&gt;
&lt;p&gt;Here’s a quick overview of some additional capabilities you might find useful:&lt;/p&gt;
&lt;h3&gt;🔄 Joins, Unions, and Time Travel&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Join tables with &lt;code&gt;.join()&lt;/code&gt;, &lt;code&gt;.left_join()&lt;/code&gt;, &lt;code&gt;.right_join()&lt;/code&gt;, or &lt;code&gt;.full_join()&lt;/code&gt; using either SQL expressions or &lt;code&gt;F&lt;/code&gt; functions.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;.union()&lt;/code&gt; to combine rows from two datasets.&lt;/li&gt;
&lt;li&gt;Query historical snapshots of Iceberg tables using &lt;code&gt;.at_snapshot(&amp;quot;SNAPSHOT_ID&amp;quot;)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.table(&amp;quot;sales&amp;quot;).at_snapshot(&amp;quot;123456789&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg time travel is fully supported in Dremio Cloud and Dremio Enterprise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Ingest External Data&lt;/h3&gt;
&lt;p&gt;You can pull data from REST APIs and ingest it directly into Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client.ingest_api(
    url=&amp;quot;https://jsonplaceholder.typicode.com/posts&amp;quot;,
    table_name=&amp;quot;sandbox.api_posts&amp;quot;,
    mode=&amp;quot;merge&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also insert Pandas DataFrames into Dremio tables using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client.table(&amp;quot;sandbox.my_table&amp;quot;).insert(&amp;quot;sandbox.my_table&amp;quot;, data=pd_df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Analyze, Visualize, and Export&lt;/h3&gt;
&lt;p&gt;Use &lt;code&gt;.group_by()&lt;/code&gt; with aggregates like &lt;code&gt;.sum()&lt;/code&gt;, &lt;code&gt;.count()&lt;/code&gt;, &lt;code&gt;.mean()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Sort with &lt;code&gt;.order_by()&lt;/code&gt;, paginate with &lt;code&gt;.offset()&lt;/code&gt;, and chart using &lt;code&gt;.chart()&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.chart(kind=&amp;quot;bar&amp;quot;, x=&amp;quot;state&amp;quot;, y=&amp;quot;pop&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Export results to local files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.to_csv(&amp;quot;output.csv&amp;quot;)
df.to_parquet(&amp;quot;output.parquet&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Data Quality Checks&lt;/h3&gt;
&lt;p&gt;Built-in expectations let you validate your data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.quality.expect_not_null(&amp;quot;pop&amp;quot;)
df.quality.expect_column_values_to_be_between(&amp;quot;pop&amp;quot;, min=1, max=1000000)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Admin and Debug Tools&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Create and manage reflections (Dremio&apos;s Unique Acceleration Layer).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Retrieve and inspect job profiles with &lt;code&gt;.get_job_profile()&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use &lt;code&gt;.explain()&lt;/code&gt; to debug SQL plans:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.explain()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Asynchronous Queries &amp;amp; CLI Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use AsyncDremioClient for non-blocking workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run queries via the command-line tool dremio-cli.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Pro tip&lt;/strong&gt;: Want to test features like data ingestion, Iceberg catalog browsing, and AI-powered analytics? &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud’s 30-day trial&lt;/a&gt; gives you full access. For local development, &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition&lt;/a&gt; is a great way to experiment.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;dremioframe&lt;/code&gt; is still evolving, but it&apos;s already a powerful toolkit for Pythonic analytics on top of Dremio’s lakehouse engine. Whether you&apos;re running federated queries, ingesting external APIs, or interacting with Iceberg tables, it helps you stay in the Python world while leveraging all the power of Dremio under the hood.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Whether you&apos;re an analyst who loves the clarity of chained DataFrame operations, or a Python developer looking to integrate Dremio into your data pipelines, &lt;code&gt;dremioframe&lt;/code&gt; offers a compelling, flexible, and powerful interface to Dremio&apos;s lakehouse capabilities.&lt;/p&gt;
&lt;p&gt;With just a few lines of code, you can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Connect securely to Dremio Cloud or Community Edition&lt;/li&gt;
&lt;li&gt;Run raw SQL or chain DataFrame-style queries&lt;/li&gt;
&lt;li&gt;Add computed columns with &lt;code&gt;.mutate()&lt;/code&gt; or build expressions with the &lt;code&gt;F&lt;/code&gt; API&lt;/li&gt;
&lt;li&gt;Work with federated sources, Apache Iceberg tables, and even ingest external data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By using &lt;code&gt;dremioframe&lt;/code&gt;, you get the best of both worlds: the expressiveness of Python and the performance of Dremio’s SQL engine.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Don’t forget : you can &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;sign up for a free 30-day trial of Dremio Cloud&lt;/a&gt; to experience all the advanced features like Agentic AI and native support for all Iceberg catalogs.&lt;br&gt;
Or, if you&apos;re experimenting locally, &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;try Community Edition&lt;/a&gt; to run federated queries and interact with Glue or Nessie-based Iceberg tables.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;code&gt;dremioframe&lt;/code&gt; project is still evolving, but it’s already a powerful toolkit for building readable, maintainable, and scalable data workflows in Python. Give it a try and let us know what you build.&lt;/p&gt;
&lt;h2&gt;NOTE&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;dremioframe&lt;/code&gt; is an unofficial library and currently in Alpha. Please submit any issues or pull requests to the &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-cloud-dremioframe?tab=readme-ov-file&quot;&gt;git repo&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Comprehensive Hands-on Walk Through of Dremio Cloud Next Gen (Hands-on with Free Trial)</title><link>https://iceberglakehouse.com/posts/2025-11-dremio-next-gen-cloud-tutorial/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-11-dremio-next-gen-cloud-tutorial/</guid><description>
[Video Playlist of this Walkthough](https://www.youtube.com/playlist?list=PL-gIUf9e9CCvY0bcRBGu2SzFFR-yJGIB6)

On November 13, at the [Subsurface Lak...</description><pubDate>Wed, 12 Nov 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/playlist?list=PL-gIUf9e9CCvY0bcRBGu2SzFFR-yJGIB6&quot;&gt;Video Playlist of this Walkthough&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;On November 13, at the &lt;a href=&quot;https://www.dremio.com/subsurface?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;Subsurface Lakehouse Conference&lt;/a&gt; in New York City, Dremio announced and released &lt;a href=&quot;https://www.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;Dremio Next Gen Cloud&lt;/a&gt;, the most complete and accessible version of its Lakehouse Platform to date. This release advances Dremio’s mission to make data lakehouses easy, fast, and affordable for organizations of any size.&lt;/p&gt;
&lt;p&gt;This tutorial offers a hands-on introduction to Dremio and walks through the new free trial experience. With managed storage and no need to connect your own infrastructure or enter a credit card (until you want to), you can explore the full platform, including new AI features, Autonomous Performance Management, and the integrated lakehouse catalog, right away.&lt;/p&gt;
&lt;h2&gt;What is Dremio?&lt;/h2&gt;
&lt;p&gt;Dremio is a Data Lakehouse Platform for the AI Era, let&apos;s explore what this means.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is a Data Lakehouse?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A data lakehouse is an architecture that uses your data lake (object storage or Hadoop) as the primary data store for flexibility and openness, then adds two layers to operationalize it like a data warehouse:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A table format such as Apache Iceberg, Delta Lake, Apache Hudi, or Apache Paimon. These formats allow structured datasets stored in Apache Parquet files to be treated as individual tables with ACID guarantees, snapshot isolation, time travel, and more - rather than just a collection of files without these capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A lakehouse catalog that tracks your lakehouse tables and other assets. It serves as the central access point for data discovery and access control.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio is designed to unify these modular lakehouse components into a seamless experience. Unlike platforms that treat Iceberg as an add-on to proprietary formats, Dremio is built to be natively Iceberg-first, delivering a warehouse-like experience without vendor lock-in.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Challenges of the Data Lakehouse&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;While lakehouses offer the benefit of serving as a central source of truth across tools, they come with practical challenges during implementation.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;How do you make your lakehouse work alongside other data that isn’t yet in the lakehouse?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;How do you optimize storage as data files accumulate and become inefficient after multiple updates and snapshots?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Which catalog should you use, and how do you deploy and maintain it for secure, governed access?&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How Dremio’s Platform Supports the Lakehouse&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Dremio simplifies many of these challenges with a platform that makes your lakehouse feel like it “just works.” It does this through several powerful features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Federation&lt;/strong&gt;: Dremio is one of the fastest engines for Apache Iceberg queries, but it also connects to and queries other databases, data lakes, data warehouses, and catalogs efficiently. This means you can start using Dremio with your existing data infrastructure and transition to a full lakehouse setup over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrated Catalog&lt;/strong&gt;: Dremio includes a built-in Iceberg catalog, ready to use from day one. This catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Is based on Apache Polaris, the community-led standard for lakehouse catalogs&lt;/li&gt;
&lt;li&gt;Automatically optimizes Iceberg table storage, eliminating manual tuning&lt;/li&gt;
&lt;li&gt;Provides governance for both Iceberg tables and SQL views with role-based and fine-grained access controls&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;End-to-End Performance Management&lt;/strong&gt;: Managing query performance can be time-consuming. Dremio reduces this burden by automatically clustering Iceberg tables and applying multiple layers of caching. One key feature is Autonomous Reflections, which accelerate queries behind the scenes based on actual usage patterns, improving performance before users even notice a problem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic and Context Layer&lt;/strong&gt;: Dremio includes a built-in semantic layer where you can define business concepts using SQL views, track lineage, and write documentation. This structure not only supports consistent usage across teams but also provides valuable context to AI systems for more accurate analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AI-Native Features&lt;/strong&gt;: Dremio Next Gen Cloud includes a built-in AI agent that can run queries, generate documentation, and create visualizations. For external AI systems, the MCP server gives agents access to both data and semantics. New AI functions also let you work with unstructured data for expanded analytical possibilities.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio aims to provide a familiar and easy SQL interface to all your data.&lt;/p&gt;
&lt;h2&gt;Registering For Dremio Trial&lt;/h2&gt;
&lt;p&gt;To get started with your Dremio Trial, head over to the &lt;a href=&quot;https://www.dremio.com/get-started/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;Getting Started Page&lt;/a&gt; and create a new account with your preferred method.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/ut2jBNf.png&quot; alt=&quot;Getting Started Page with Dremio&quot;&gt;&lt;/p&gt;
&lt;p&gt;If using Google/Microsoft/Github you&apos;ll be all right up after authenticating, if signing up with your email you&apos;ll get an email to confirm your registration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/fnzehGU.png&quot; alt=&quot;Confirmation Email&quot;&gt;&lt;/p&gt;
&lt;p&gt;When you create a new Dremio account, it automatically creates a new &lt;code&gt;Organization&lt;/code&gt;, which can contain multiple &lt;code&gt;Projects&lt;/code&gt;. The organization will be assigned a default name, which you can change later.&lt;/p&gt;
&lt;p&gt;On the next screen, you’ll name your first project. This initial project will use Dremio’s managed storage as the default storage for the lakehouse catalog.&lt;/p&gt;
&lt;p&gt;If you prefer to use your own data lake as catalog storage, you can create a new project when you&apos;re ready. Currently, only Amazon S3 is supported for custom catalog storage, with additional options coming soon.&lt;/p&gt;
&lt;p&gt;Even though S3 is the only supported option for Dremio Catalog storage at the moment, Dremio still allows you to connect to other Iceberg catalogs backed by any cloud storage solution and data lakes using its wide range of source connectors.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/fnzehGU.png&quot; alt=&quot;Choosing your Dremio Region and Project Name&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now you&apos;ll be on your Dremio Dashboard where you&apos;ll wait a few minutes for your organization to be provisioned.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/JlPAxrI.png&quot; alt=&quot;Provisioning of Dremio Project&quot;&gt;&lt;/p&gt;
&lt;p&gt;Once the environment is provisioned you&apos;ll see several options including a chat box to work with the new integrated Dremio AI Agent which we will revisit later in this tutorial.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/lO0aeGl.png&quot; alt=&quot;The Dremio environment is now active!&quot;&gt;&lt;/p&gt;
&lt;p&gt;One of the best ways to get started is by adding data to Dremio by clicking &lt;code&gt;add data&lt;/code&gt; which will open a window where you can either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Connect an existing Database, Data Lake, Data Warehouse or Data Lakehouse catalog to begin querying data you have in other platforms&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upload a CSV, JSON or PARQUET file which will convert the file into an Iceberg table in Dremio Catalog for you to be able to query.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If looking for some sample files to upload, &lt;a href=&quot;https://www.kaggle.com/&quot;&gt;Kaggle&lt;/a&gt; is always a good place to find some datasets to play with.&lt;/p&gt;
&lt;p&gt;Although for this tutorial let&apos;s use SQL to create tables in Dremio Catalog, insert records into those tables and then query them.&lt;/p&gt;
&lt;h2&gt;Curating Your Lakehouse&lt;/h2&gt;
&lt;p&gt;Let&apos;s go visit the data explorer to show you how you will navigate your integrated catalog and other datasources. Click on the second icon from the top in the menu that is to the left of the screen that looks like a table, this will take you to the dataset explorer.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/iBnA0TP.png&quot; alt=&quot;Dremio&apos;s Navigation Menu&quot;&gt;&lt;/p&gt;
&lt;p&gt;In the dataset explorer you&apos;ll see two sections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Namespaces&lt;/strong&gt;: This is the native Apache Polaris based catalog that belongs to your project. You create namespaces as top-level folders to organize your Apache Iceberg Tables and SQL Views (view SQL can refer to Iceberg and Non-Iceberg datasets, like joining an Iceberg table and a Snowflake Table).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;: These are the other sources you&apos;ve connected to Dremio using Dremio&apos;s connectors. You can open up a source and see all the tables available inside of it from the dataset explorer.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Please click on the plus sign next to &amp;quot;namespaces&amp;quot; and add a new namespace called &amp;quot;dremio&amp;quot; this will be necessary to run some SQL scripts I give you without needing to modify them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/OSoBEP0.png&quot; alt=&quot;Adding a new namespace&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now you&apos;ll see the new &lt;code&gt;dremio&lt;/code&gt; namespace and in their we can create new tables and views. You may notice there is already a sample data namespace which includes a variety of sample data you can use to experiment with if you want.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/3PfTjp1.png&quot; alt=&quot;The new namespace has been added to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Running Some SQL&lt;/h2&gt;
&lt;p&gt;Now head over to the the &amp;quot;SQL Runner&amp;quot; a full SQL IDE built right into the Dremio experience which includes autocomplete, syntax highlighting, function lookup, the typical IDE shortcuts and much more. It is accessed by clicking the third menu icon which looks like a mini terminal window.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/lKIFs6c.png&quot; alt=&quot;The Dremio SQL Runner&quot;&gt;&lt;/p&gt;
&lt;p&gt;Let me call out a few things to your attention:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;On the left you&apos;ll notice a column where you can browse available datasets you can use this drag dataset names or column names into your queries so you don&apos;t have to type things out everytime.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You&apos;ll notice this column has a second tab called scripts, you can save the SQL in any tab as a script you can comeback to later, great for template scripts or scripts you haven&apos;t finished yet.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The SQL editor is on top and the results viewer is on the bottom, if you run multiple SQL statements the results viewer will give you a tab for the result of each query run making it easy to isolate the results of different parts of your script.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is a lot more to learn about the SQL runner, but let&apos;s go ahead and run some SQL. I&apos;ve written several SQL scripts you can copy into the SQL Runner and run as is. Choose any of the below and copy them into SQL runner and run the SQL. Give the code a look over, comments in the code help explain what it is doing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/finance_example.sql&quot;&gt;Finance Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/gov_example.sql&quot;&gt;Government Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/healthcare_example.sql&quot;&gt;Healthcare Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/insurance_example.sql&quot;&gt;Government Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/manufacturing.sql&quot;&gt;Manufacturing Example with Data Health Checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/retail.sql&quot;&gt;Retail Example with Physical Transformations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/supply_chain_example.sql&quot;&gt;Supply Chain Example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The SQL for the majority of these examples follow a similar pattern:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create sub folder for the example&lt;/li&gt;
&lt;li&gt;Create a bronze/silver/gold subfolder in that subfolder&lt;/li&gt;
&lt;li&gt;Create and insert data in the base tables in the raw folder&lt;/li&gt;
&lt;li&gt;Join and model using a SQL view to create the silver layer&lt;/li&gt;
&lt;li&gt;Create a use case specific views from the silver view in the gold layer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This represents a very typical way of using Dremio where you model your datasets not by replicating your data but logically with SQL views. Dremio&apos;s autonomous reflections feature will see how these views are queried and dynamically determine what views should be materialized into the Dremio&apos;s reflection cache without anyone having to lift a finger keeping everything performance on storage and compute usage. A data engineer could also manually trigger the creation of a reflection and Dremio will assign that reflection a score to help understand whether the reflection is providing value or not, we&apos;ll show this when we got over Dremio&apos;s settings UI.&lt;/p&gt;
&lt;p&gt;Two of the examples do something a little different:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The Manufacturing example uses SQL views to create a medallion architecture, but it then also creates health check views which allow you to get a quick glance at the health of a dataset by aggregating nulls, duplicates, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The retail example shows you what it may look like if you don&apos;t want to model your data logically but physically using CTAS (Create Table As) statements to materialize the data as physical tables through the bronze/silver/gold layers.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;AI Visualizations&lt;/h2&gt;
&lt;p&gt;Now that you&apos;ve added some data to your lakehouse but running some of my example scripts, let&apos;s show you some of the benefits of Dremio&apos;s new integrated AI Agent.&lt;/p&gt;
&lt;p&gt;While if you are planning to build long lived BI dashboards you should still leverage integrations most BI tools like Tableau and Power BI have with Dremio, but if you need to just seem something visual quick or just need something to screenshot into a presentation the AI Agent can be quite helpful.&lt;/p&gt;
&lt;p&gt;So let&apos;s say I run the supply chain example, I could prompt the AI Agent with a prompt like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Can you create a bar chart based on the data in the gold layer of my supply chain data&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The AI Agent will pull the list of tables your particular user has access to and see if it finds the dataset I loosely referred to. It will then pull the schema and other data from the semantic layer to better understand the dataset before deciding how to best visualize it. But of course my prompt was vague and the AI agent decides to confirm some details with me:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;I can create that bar chart : quick clarifying questions before I run the query
I inspected your gold layer and found the view dremio.supply_chain.gold.daily_supplier_delivery_kpis with these columns:

supplier_id (string)
receipt_date (date)
lines_received (int)
otif_rate (double)
on_time_rate (double)
in_full_rate (double)
total_received_qty (int)
total_order_qty (int)
gross_fill_rate (decimal)
To produce the bar chart I need to know which metric and timeframe you want visualized. Common options:

Average OTIF per supplier (bar per supplier showing average “otif_rate” across the dataset)
Average gross fill rate per supplier (bar per supplier showing average “gross_fill_rate”)
Total lines received per supplier for a specific date or date range (sum of “lines_received”)
Side-by-side bars per supplier showing two metrics (e.g., OTIF and gross_fill_rate) : I can produce grouped bars if you confirm.
Suggested defaults (if you don’t specify):

Metric: avg(otif_rate)
Timeframe: all available dates
Chart: vertical bar, supplier on x-axis, metric on y-axis, ordered by metric desc
If that works, I will run this SQL against the gold view and build the bar chart:

SELECT “supplier_id”, AVG(“otif_rate”) AS “avg_otif_rate”, AVG(“gross_fill_rate”) AS “avg_gross_fill_rate” FROM “first-project”.“dremio”.“supply_chain”.“gold”.“daily_supplier_delivery_kpis” : optionally add: WHERE “receipt_date” BETWEEN DATE ‘2025-08-01’ AND DATE ‘2025-08-31’ GROUP BY “supplier_id” ORDER BY “avg_otif_rate” DESC

Please confirm:

Which metric? (otif_rate, gross_fill_rate, lines_received, total_received_qty, etc.)
Timeframe (all data, a date, or a date range). If you say a month/week/day, I will assume the current year unless you specify otherwise.
Chart type (single bar, grouped bars for two metrics).
After you confirm I will run the query and produce the bar chart for you.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I respond with&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. Average OTIF per supplier
2. All data
3. Vertical Bar
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It then explains to the me it&apos;s thought process to illustrate where it needs more detail, I notice at this point it is including the project name &lt;code&gt;first-project&lt;/code&gt; in the table name which is not necessary as this is assumed so it should start with the namespace &lt;code&gt;dremio&lt;/code&gt;. I tell it this and get back my visualization.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/KeHWL8h.png&quot; alt=&quot;Generating a Visualization with the Dremio AI Agent&quot;&gt;&lt;/p&gt;
&lt;p&gt;We found success starting with quite a vague question but working with the AI we were able to get a visualization of a useful metric within a few minutes.&lt;/p&gt;
&lt;h2&gt;AI Function&lt;/h2&gt;
&lt;p&gt;Using your data to create visualization isn&apos;t the only cool AI integration in the Dremio Arsenal. Dremio also has added a variety of new SQL AI Functions which allow you to do a variety of things like turn unstructured data into structured data. Let&apos;s see a very simple example you can run right in your SQL runner assuming you have a &lt;code&gt;dremio&lt;/code&gt; namespace.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create the recipes table with an ARRAY column for ingredients (sample rows)
-- Note: this uses CREATE TABLE AS SELECT to create a physical table with sample data.
CREATE FOLDER IF NOT EXISTS dremio.recipes;
CREATE TABLE IF NOT EXISTS dremio.recipes.recipes AS
SELECT 1 AS &amp;quot;id&amp;quot;,
       &apos;Mild Salsa&apos; AS &amp;quot;name&amp;quot;,
       ARRAY[&apos;tomato&apos;,&apos;onion&apos;,&apos;cilantro&apos;,&apos;jalapeno&apos;,&apos;lime&apos;] AS &amp;quot;ingredients&amp;quot;,
       CURRENT_TIMESTAMP AS &amp;quot;created_at&amp;quot;
UNION ALL
SELECT 2, &apos;Medium Chili&apos;, ARRAY[&apos;beef&apos;,&apos;tomato&apos;,&apos;onion&apos;,&apos;chili powder&apos;,&apos;cumin&apos;,&apos;jalapeno&apos;], CURRENT_TIMESTAMP
UNION ALL
SELECT 3, &apos;Spicy Vindaloo&apos;, ARRAY[&apos;chicken&apos;,&apos;chili&apos;,&apos;ginger&apos;,&apos;garlic&apos;,&apos;vinegar&apos;,&apos;habanero&apos;], CURRENT_TIMESTAMP;

-- Create View where AI is used to classify each recipe as Mild, Medium or Spicy
CREATE OR REPLACE VIEW dremio.recipes.recipes_enhanced AS SELECT id,
       name,
       ingredients,
       AI_CLASSIFY(&apos;Identify the Spice Level:&apos; || ARRAY_TO_STRING(ingredients, &apos;,&apos;), ARRAY [ &apos;mild&apos;, &apos;medium&apos;, &apos;spicy&apos; ]) AS spice_level
from   dremio.recipes.recipes;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The First SQL statement creates a table of recipes where the ingredients are an array of strings&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The Second SQL statement create a view where we use the AI_CLASSIFY function to prompt the AI given the ingredients whether the recipe is &lt;code&gt;mild&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt; or &lt;code&gt;spicy&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/PiJGMmF.png&quot; alt=&quot;The Dremio AI Functions&quot;&gt;&lt;/p&gt;
&lt;p&gt;With these AI functions you can also use it pull data from JSON files or folders of images to generate structured datasets. Imagine taking a folder of scans of paper applications and turning them into an iceberg table with all the right fields by having the AI scan these images, this is kind of use case made possible by these functions.&lt;/p&gt;
&lt;h2&gt;Dremio Jobs Pane&lt;/h2&gt;
&lt;p&gt;Want to see what queries are coming or investigate deeper why a query may have failed or taken longer than expecting, the Dremio job pane which is the next option on the left menu will allow you see all your jobs and then click on them to see exaustive detail on how they were processed.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/EhD18PE.png&quot; alt=&quot;Dremio Job Pane&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Dremio Settings&lt;/h2&gt;
&lt;p&gt;If you click on the last menu item, the gear, you&apos;ll get two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project Settings&lt;/li&gt;
&lt;li&gt;Org Settings&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Project Settings&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/YCfoaMz.png&quot; alt=&quot;Dremio Project Settings&quot;&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can find project info like:
&lt;ul&gt;
&lt;li&gt;project name and id (project names are fixed, org names can change)&lt;/li&gt;
&lt;li&gt;MCP server url to connect your external AI agent to leverage your Dremio instance&lt;/li&gt;
&lt;li&gt;JDBC url to connect to Dremio using external JDBC clients and in custom scripts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; SQL can be sent to Dremio for execution outside of Dremio&apos;s UI using JDBC, ODBC, Apache Arrow Flight and Dremio&apos;s REST API. Refer to docs.dremio.com for documentation on how to leverage these interfaces.&lt;/p&gt;
&lt;p&gt;Also in project settings you&apos;ll find sections like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Catalog: To update catalog settings like how often metadata refresh should happen. (&lt;strong&gt;Note:&lt;/strong&gt; Lineage view is based on the metadata as of the last refresh so if something isn&apos;t reflected in lineage the metadata may not have refreshed yet. You can either make the metadata refresh more frequently or wait till it refreshes on its current schedule.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Engines: For managing your different Dremio execution engines, what is their size, when they should spin up and when they should spin down&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;BI Tools: Enable or disable Tableu and Power BI buttons&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor: Dashboard to monitor Dremio project health&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reflections: See scores on reflections you have created, you can also delete reflections if you no longer need them from here&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Engine Routing: Create rules for which jobs should go to which engines, for example jobs from certain users may be routed to their own engine which is tracked for charge backs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Preferences: Turn on and off certain Dremio features&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Organization Settings&lt;/h3&gt;
&lt;p&gt;Under Organizations settings you&apos;ll find:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;name of org, which can be changed&lt;/li&gt;
&lt;li&gt;Manage authentication protocols&lt;/li&gt;
&lt;li&gt;Manage projects, users, and roles&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;User Settings&lt;/h3&gt;
&lt;p&gt;At the very bottom left corner there is a button to see settings for the individual user. The Main use for this is to change to dark/light more and to create PAT tokens for authenticating external clients.&lt;/p&gt;
&lt;h2&gt;Granting Access&lt;/h2&gt;
&lt;p&gt;Once you create new non-admin users in your Dremio org, they&apos;ll have zero access to anything so you&apos;ll need to give them precise access to particular projects, namespaces, folders, sources etc.&lt;/p&gt;
&lt;p&gt;While you can do this for an individual user, it will likely be easier to create &amp;quot;roles&amp;quot; you can grant access to groups of users with. Below is the example of the kind of SQL you may use to grant access to a single namespace for a new user.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;-- Give Permissions to project
GRANT SELECT, VIEW REFLECTION, VIEW JOB HISTORY, USAGE, MONITOR,
       CREATE TABLE, INSERT, UPDATE, DELETE, DROP, ALTER, EXTERNAL QUERY, ALTER REFLECTION, OPERATE
ON PROJECT
TO USER &amp;quot;alphatest2user@alexmerced.com&amp;quot;;

-- Give Permissions to Namespace in Catalog
GRANT ALTER, USAGE, SELECT, WRITE, DROP on FOLDER &amp;quot;dremio&amp;quot; to USER &amp;quot;alphatest2user@alexmerced.com&amp;quot;;

-- Give Permissions to a Folder in the namespace
GRANT ALTER, USAGE, SELECT, WRITE, DROP on FOLDER dremio.recipes to USER &amp;quot;alphatest2user@alexmerced.com&amp;quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Connecting your Dremio Catalog to Other Engines Like Spark&lt;/h2&gt;
&lt;p&gt;Now you can connect to the Dremio Platform using JDBC/ODBC/ADBC-Flight/REST and send SQL to Dremio for Dremio to execute which I hope you take full advantage of. Although, sometimes you are sharing a dataset in your catalog with someone else who wants to use their preferred compute tool. Dremio Catalog bein Apache Polaris based supports the Apache Iceberg REST Catalog SPEC meaning it can connect to pretty much to any Apache Iceberg supporting tool. Below is an example of how you&apos;d connect in Spark.&lt;/p&gt;
&lt;p&gt;Run a local spark envrionment using the following command:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;docker run -p 8888:8888 -e DREMIO_PAT={YOUR PAT TOKEN} alexmerced/spark35nb:latest&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Then use the following code to run spark code against Dremio Catalog (keep in mind the CATALOG_NAME variable should match your project name).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Fetch Dremio base URL and PAT from environment variables
DREMIO_CATALOG_URI = &amp;quot;https://catalog.dremio.cloud/api/iceberg&amp;quot;
DREMIO_AUTH_URI = &amp;quot;https://login.dremio.cloud/oauth/token&amp;quot;
DREMIO_PAT = os.environ.get(&apos;DREMIO_PAT&apos;)
CATALOG_NAME = &amp;quot;first-project&amp;quot; # should be project name

if not DREMIO_CATALOG_URI or not CATALOG_NAME or not DREMIO_AUTH_URI or not DREMIO_PAT:
    raise ValueError(&amp;quot;Please set environment variables DREMIO_CATALOG_URI, DREMIO_AUTH_URI and DREMIO_PAT.&amp;quot;)

# Configure Spark session with Iceberg and Dremio catalog settings
conf = (
    pyspark.SparkConf()
        .setAppName(&apos;DremioIcebergSparkApp&apos;)
        # Required external packages For FILEIO (org.apache.iceberg:iceberg-azure-bundle:1.9.2, org.apache.iceberg:iceberg-aws-bundle:1.9.2, org.apache.iceberg:iceberg-azure-bundle:1.9.2, org.apache.iceberg:iceberg-gcp-bundle:1.9.2)
        .set(&apos;spark.jars.packages&apos;, &apos;org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.2,com.dremio.iceberg.authmgr:authmgr-oauth2-runtime:0.0.5,org.apache.iceberg:iceberg-aws-bundle:1.9.2&apos;)
        # Enable Iceberg Spark extensions
        .set(&apos;spark.sql.extensions&apos;, &apos;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&apos;)
        # Define Dremio catalog configuration using RESTCatalog
        .set(&apos;spark.sql.catalog.dremio&apos;, &apos;org.apache.iceberg.spark.SparkCatalog&apos;)
        .set(&apos;spark.sql.catalog.dremio.catalog-impl&apos;, &apos;org.apache.iceberg.rest.RESTCatalog&apos;)
        .set(&apos;spark.sql.catalog.dremio.uri&apos;, DREMIO_CATALOG_URI)
        .set(&apos;spark.sql.catalog.dremio.warehouse&apos;, CATALOG_NAME)  # Not used but required by Spark
        .set(&apos;spark.sql.catalog.dremio.cache-enabled&apos;, &apos;false&apos;)
        .set(&apos;spark.sql.catalog.dremio.header.X-Iceberg-Access-Delegation&apos;, &apos;vended-credentials&apos;)
        # Configure OAuth2 authentication using PAT
        .set(&apos;spark.sql.catalog.dremio.rest.auth.type&apos;, &apos;com.dremio.iceberg.authmgr.oauth2.OAuth2Manager&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.token-endpoint&apos;, DREMIO_AUTH_URI)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.grant-type&apos;, &apos;token_exchange&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.client-id&apos;, &apos;dremio&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.scope&apos;, &apos;dremio.all&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.token-exchange.subject-token&apos;, DREMIO_PAT)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.token-exchange.subject-token-type&apos;, &apos;urn:ietf:params:oauth:token-type:dremio:personal-access-token&apos;)
)

# Initialize Spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print(&amp;quot;✅ Spark session connected to Dremio Catalog.&amp;quot;)

# Step 1: Create a namespace (schema) in the Dremio catalog
spark.sql(&amp;quot;CREATE NAMESPACE IF NOT EXISTS dremio.db&amp;quot;)
# spark.sql(&amp;quot;CREATE NAMESPACE IF NOT EXISTS dremio.db.test1&amp;quot;)
print(&amp;quot;✅ Namespaces Created&amp;quot;)

# Step 2: Create sample Iceberg tables in the Dremio catalog
spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE IF NOT EXISTS dremio.db.customers (
    id INT,
    name STRING,
    email STRING
)
USING iceberg
&amp;quot;&amp;quot;&amp;quot;)

spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE IF NOT EXISTS dremio.db.orders (
    order_id INT,
    customer_id INT,
    amount DOUBLE
)
USING iceberg
&amp;quot;&amp;quot;&amp;quot;)

print(&amp;quot;✅ Tables Created&amp;quot;)

# Step 3: Insert sample data into the tables
customers_data = [
    Row(id=1, name=&amp;quot;Alice&amp;quot;, email=&amp;quot;alice@example.com&amp;quot;),
    Row(id=2, name=&amp;quot;Bob&amp;quot;, email=&amp;quot;bob@example.com&amp;quot;)
]

orders_data = [
    Row(order_id=101, customer_id=1, amount=250.50),
    Row(order_id=102, customer_id=2, amount=99.99)
]

print(&amp;quot;✅ Dataframes Generated&amp;quot;)

customers_df = spark.createDataFrame(customers_data)
orders_df = spark.createDataFrame(orders_data)

customers_df.writeTo(&amp;quot;dremio.db.customers&amp;quot;).append()
orders_df.writeTo(&amp;quot;dremio.db.orders&amp;quot;).append()

print(&amp;quot;✅ Tables created and sample data inserted.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Dremio Next Gen Cloud represents a major leap forward in making the data lakehouse experience seamless, powerful, and accessible. Whether you&apos;re just beginning your lakehouse journey or modernizing a complex data environment, Dremio gives you the tools to work faster and smarter - with native Apache Iceberg support, AI-powered features, and a fully integrated catalog.&lt;/p&gt;
&lt;p&gt;From federated queries across diverse sources to autonomous performance tuning, Dremio abstracts away the operational headaches so you can focus on delivering insights. And with built-in AI capabilities, you&apos;re not just managing data - you’re unlocking its full potential.&lt;/p&gt;
&lt;p&gt;If you haven’t already, &lt;a href=&quot;https://www.dremio.com/get-started/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;sign up for your free trial&lt;/a&gt; and start building your lakehouse: no infrastructure or credit card required.&lt;/p&gt;
&lt;p&gt;The next generation of analytics is here. Time to explore what’s possible.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025-2026 Guide to Learning about Apache Iceberg, Data Lakehouse &amp; Agentic AI</title><link>https://iceberglakehouse.com/posts/2025-10-2026-guide-to-learning-lakehouse-iceberg-agentic-ai/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-2026-guide-to-learning-lakehouse-iceberg-agentic-ai/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-10-2026-guide-to-learning-lakehouse...</description><pubDate>Thu, 23 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-10-2026-guide-to-learning-lakehouse-iceberg-agentic-ai/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The data world is evolving fast. Just a few years ago, building a modern analytics stack meant stitching together tools, ETL pipelines, and compromises. Today, open standards like Apache Iceberg, modular architectures like the data lakehouse, and emerging patterns like Agentic AI are reshaping how teams store, manage, and use data.&lt;/p&gt;
&lt;p&gt;But with all this innovation comes one challenge: where do you start?&lt;/p&gt;
&lt;p&gt;This guide was created to answer that question. Whether you&apos;re a data engineer exploring the Iceberg table format, an architect building a lakehouse, or a developer curious about AI agents that interact with real-time data, this resource will walk you through it. No hype. No fluff. Just a curated directory of the best learning paths, tools, and concepts to help you build a practical foundation.&lt;/p&gt;
&lt;p&gt;We will break down the links into many categories to help you find what you are looking for. There is more content beyond what I have listed here and here are two good directories where you can find more content to explore.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Blog Directories&lt;/strong&gt;
We will break down the links into many categories to help you find what you are looking for. There is more content beyond what I have listed here and here are two good directories where you can find more content to explore.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Developer Hub which has an OSS Blogroll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;The Lakehouse Blog Directory&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;
I&apos;ve had the honor of getting to participate in some long form written content around the lakehouse, make of these you can get for free, links below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;
Below are some links where you can network with other lakehouse enthusiasts and discover lakehouse conferences and meetups near you!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Data Lakehouse&lt;/h2&gt;
&lt;p&gt;The idea behind a data lakehouse is simple: keep the flexibility of a data lake, add the performance and structure of a warehouse, and make it all accessible from one place. But turning that idea into a working architecture takes more than just buzzwords. In this section, you&apos;ll find tutorials, architectural guides, and practical walkthroughs that explain how lakehouses work, when they make sense, and how to get started, whether you’re running everything on object storage or looking to unify data access across teams and tools.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/the-2025-and-2026-ultimate-guide?r=h4f8p&quot;&gt;2026 Guide to the Data Lakehouse Ecosystem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/looking-back-the-last-year-in-lakehouse-oss-advances-in-apache-arrow-iceberg-polaris-incubating/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Looking back the last year in Lakehouse OSS: Advances in Apache Arrow, Iceberg &amp;amp; Polaris (incubating)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/scaling-data-lakes-moving-from-raw-parquet-to-iceberg-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Scaling Data Lakes: Moving from Raw Parquet to Iceberg Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-makes-apache-iceberg-lakehouses-easy/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;5 Ways Dremio Makes Apache Iceberg Lakehouses Easy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Guide to Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Apache Iceberg&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is the table format that makes data lakehouses actually work. It brings support for ACID transactions, schema evolution, time travel, and scalable performance to your cloud storage, without locking you into a vendor or engine. If you’ve ever wrestled with Hive tables or brittle partitioning logic, this section is for you. Here, you&apos;ll find beginner-friendly resources, deep dives into metadata and catalogs, and hands-on guides for working with Iceberg using engines like Spark, Flink, and Dremio.&lt;/p&gt;
&lt;h3&gt;What are Lakehouse Open Table Formats&lt;/h3&gt;
&lt;p&gt;Table formats are the backbone of the modern lakehouse. They define how data files are organized, versioned, and transacted, bringing warehouse‑level reliability to open storage. This section explores what makes formats like Apache Iceberg, Delta Lake, and Apache Hudi so important. You’ll learn how they handle schema evolution, partitioning, and ACID transactions while staying engine‑agnostic, ensuring your data remains open, performant, and ready for any workload.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;What is a Data Lakehouse Table Format&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/the-ultimate-guide-to-open-table?r=h4f8p&quot;&gt;Ultimate Guide to Open Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Tutorials&lt;/h3&gt;
&lt;p&gt;Getting hands‑on is the fastest way to learn Apache Iceberg. In these tutorials, you’ll spin up local environments, run your first SQL commands, and connect Iceberg tables with catalogs like Apache Polaris or engines like Spark and Dremio. Each guide walks you through setup, basic operations, and troubleshooting so you can move from theory to practice without friction.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/tutorial-intro-to-apache-iceberg?r=h4f8p&quot;&gt;Intro to Iceberg with Apache Spark, Apache Polaris &amp;amp; Minio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/try-apache-polaris-incubating-on-your-laptop-with-minio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Try Apache Polaris (incubating) on Your Laptop with Minio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Migration Tooling and Ingestion&lt;/h3&gt;
&lt;p&gt;Moving existing datasets into Apache Iceberg doesn’t have to be painful. This section highlights migration patterns, ingestion tools, and automation workflows that make it easier to adopt Iceberg at scale. You’ll find step‑by‑step resources covering snapshot‑based migrations, bulk ingests, and hybrid models that help teams modernize data lakes while minimizing downtime and duplication.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/migration-guide-for-apache-iceberg-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Migration Guide for Apache Iceberg Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/8-tools-for-ingesting-data-into-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;8 Tools For Ingesting Data Into Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Catalogs&lt;/h3&gt;
&lt;p&gt;A table format is only as useful as the catalog that organizes it. Iceberg catalogs manage metadata, access control, and engine interoperability, essential pieces of a production lakehouse. In this section, you’ll explore the expanding catalog ecosystem, from open implementations like Apache Polaris to commercial and hybrid options. These resources explain how catalogs enable discoverability, governance, and smooth multi‑engine coordination across your data environment.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/an-exploration-of-the-commercial?r=h4f8p&quot;&gt;An Exploration of Commercial Ecosystem of Iceberg Catalogs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/building-a-universal-lakehouse-catalog?r=h4f8p&quot;&gt;Building a Universal Lakehouse Catalog: Catalogs beyond Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-growing-apache-polaris-ecosystem-the-growing-apache-iceberg-catalog-standard/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;The Growing Apache Polaris Ecosystem (The Growing Apache Iceberg Catalog Standard)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Table Optimization&lt;/h3&gt;
&lt;p&gt;Keeping Iceberg tables fast requires more than good schema design. Over time, data fragmentation, small files, and metadata sprawl can slow queries and inflate costs. The articles in this section show how to maintain healthy tables through compaction, clustering, and automatic optimization. You’ll also learn how modern platforms like Dremio manage this maintenance autonomously so performance tuning doesn’t become a full‑time job.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/optimizing-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Optimizing Apache Iceberg Tables – Manual and Automatic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-table-performance-management-with-dremios-optimize/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Apache Iceberg Table Performance Management with Dremio’s OPTIMIZE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/minimizing-iceberg-table-management-with-smart-writing/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Minimizing Iceberg Table Management with Smart Writing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-table-storage-management-with-dremios-vacuum-table/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Apache Iceberg Table Storage Management with Dremio’s VACUUM TABLE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/@alexmercedtech/materialization-and-acceleration-in-the-iceberg-lakehouse-era-comparing-dremio-trino-doris-de3c96413b1a&quot;&gt;Materialization and Query Optimization in the Iceberg Era&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Technical Deep Dives&lt;/h3&gt;
&lt;p&gt;Once you understand the basics, the real fun begins. These deep dives unpack how Iceberg works under the hood, covering metadata structures, query caching, authentication, and advanced performance topics. Whether you’re benchmarking, extending the format, or building your own catalog integration, this section will help you understand Iceberg’s architecture and internal mechanics in detail.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/query-results-caching-on-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Query Results Caching on Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/benchmarking-framework-for-the-apache-iceberg-catalog-polaris/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Benchmarking Framework for the Apache Iceberg Catalog, Polaris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/too-many-roundtrips-metadata-overhead-in-the-modern-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Too Many Roundtrips: Metadata Overhead in the Modern Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/introducing-dremio-auth-manager-for-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Introducing Dremio Auth Manager for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/dremios-apache-iceberg-clustering-technical-blog/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Dremio’s Apache Iceberg Clustering: Technical Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Future of Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Apache Iceberg continues to evolve alongside emerging workloads like Agentic AI and next‑generation file formats. This section looks ahead at what’s coming; new format versions, engine integrations, and evolving standards such as Polaris and REST catalogs. If you want to stay informed on where Iceberg is heading and how it fits into the broader open‑data movement, start here.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-evolving-file-format-landscape-in-ai-era-parquet-lance-nimble-and-vortex-and-what-it-means-for-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Exploring the Evolving File Format Landscape in AI Era: Parquet, Lance, Nimble and Vortex And What It Means for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-v3/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;What’s New in Apache Iceberg Format Version 3?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/the-state-of-apache-iceberg-v4-october-2025-edition-c186dc29b6f5&quot;&gt;The State of Apache Iceberg v4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Agentic AI&lt;/h2&gt;
&lt;p&gt;Agentic AI is a new class of systems that don’t just answer questions, they take action. These agents make decisions, follow workflows, and learn from outcomes, but they’re only as smart as the data they can access. That’s where open lakehouse architectures come in. This section explores the intersection of data architecture and autonomous systems, with content focused on how to power agents using structured, governed, and real-time data from your Iceberg-based lakehouse. From semantic layers to zero-ETL federation, you&apos;ll see what it takes to build AI that isn&apos;t just reactive, but genuinely useful.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-model-context-protocol-mcp-a-beginners-guide-to-plug-and-play-agents/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;The Model Context Protocol (MCP): A Beginner’s Guide to Plug-and-Play Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/understanding-rpc-and-mcp-in-agentic?r=h4f8p&quot;&gt;Understanding the Role of RPC in Agentic AI &amp;amp; MCP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/who-benefits-from-mcp-on-analytics-platforms/&quot;&gt;Who Benefits From MCP on an Analytics Platform?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/tutorial-multi-agent-collaboration?r=h4f8p&quot;&gt;Tutorial: Multi-Agent Collaboration with LangChain, MCP, and Google A2A Protocol&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/composable-analytics-with-agents?r=h4f8p&quot;&gt;Composable Analytics with Agents: Leveraging Virtual Datasets and the Semantic Layer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/unlocking-the-power-of-agentic-ai?r=h4f8p&quot;&gt;Unlocking the Power of Agentic AI with Dremio and Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-agentic-ai-needs-a-data-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Why Agentic AI Needs a Data Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/testing-mcp-integration-in-existing-data-pipelines/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Test Driving MCP: Is Your Data Pipeline Ready to Talk?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-dremios-mcp-server-with-agentic-ai-frameworks/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Using Dremio’s MCP Server with Agentic AI Frameworks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-the-dremio-mcp-server-with-any-llm-model/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Using Dremio MCP with any LLM Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-dremio-reflections-give-agentic-ai-a-unique-edge/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;How Dremio Reflections Give Agentic AI a Unique Edge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/optimizing-apache-iceberg-for-agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Optimizing Apache Iceberg for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>An Exploration of the Commercial Iceberg Catalog Ecosystem</title><link>https://iceberglakehouse.com/posts/2025-10-exploring-commerical-apache-iceberg-catalogs/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-exploring-commerical-apache-iceberg-catalogs/</guid><description>
**Get Data Lakehouse Books:**

- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](h...</description><pubDate>Tue, 21 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg has quickly become the table format of choice for building open, flexible, and high-performance data lakehouses. It solves long-standing issues around schema evolution, ACID transactions, and engine interoperability. Enabling a shared, governed data layer across diverse compute environments.&lt;/p&gt;
&lt;p&gt;But while the table format itself is open and standardized, the catalog layer, the system responsible for tracking and exposing table metadata, is where key decisions begin to shape your architecture. How your organization selects and manages an Iceberg catalog can influence everything from query performance to write flexibility to vendor lock-in risk.&lt;/p&gt;
&lt;p&gt;This blog explores the current landscape of commercial Iceberg catalogs, focusing on the emerging Iceberg REST Catalog (IRC) standard and how different vendors interpret and implement it. We’ll examine where catalogs prioritize cross-engine interoperability, where they embed proprietary optimization features, and how organizations can approach these trade-offs strategically.&lt;/p&gt;
&lt;p&gt;You’ll also learn what options exist when native optimizations aren’t available, including how to design your own or consider a catalog-neutral optimization tool like Ryft.io (when using cloud object storage).&lt;/p&gt;
&lt;p&gt;By the end, you&apos;ll have a clear view of the commercial ecosystem, and a framework to help you choose a path that fits your technical goals while minimizing operational friction.&lt;/p&gt;
&lt;h2&gt;The Role of Iceberg REST Catalogs in the Modern Lakehouse&lt;/h2&gt;
&lt;p&gt;At the heart of every Apache Iceberg deployment is a catalog. It’s more than just a registry of tables, it’s the control plane for transactions, schema changes, and metadata access. And thanks to the Apache Iceberg REST Catalog (IRC) specification, catalogs no longer need to be tightly coupled to any single engine.&lt;/p&gt;
&lt;p&gt;The IRC defines a standardized, HTTP-based API that lets query engines like Spark, Trino, Flink, Dremio, and others communicate with a catalog in a consistent way. That means developers can write data from one engine and read it from another, without worrying about format mismatches or metadata drift.&lt;/p&gt;
&lt;p&gt;This decoupling brings three major benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Multi-language support&lt;/strong&gt;: Since the interface is language-agnostic, you can interact with the catalog from tools written in Java, Python, Rust, or Go.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute independence&lt;/strong&gt;: Query and write operations don’t require the catalog to be embedded in the engine, everything runs through REST.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Adoption of the IRC spec is growing rapidly. Vendors like Dremio, Snowflake, Google, and Databricks now offer catalogs that expose some or all of the REST API. This trend signals a broader shift toward open metadata services, where engine choice is driven by workload needs, not infrastructure constraints.&lt;/p&gt;
&lt;p&gt;But as we’ll see next, implementing the REST API is only part of the story. The real architectural decisions start when you consider &lt;strong&gt;how these catalogs handle optimization, write access, and cross-engine consistency&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Key Considerations When Choosing a Catalog&lt;/h2&gt;
&lt;p&gt;Picking a catalog shapes how your Iceberg lakehouse runs. The decision affects who can read and write data, how tables stay performant, and how easy it is to run multiple engines. Focus on facts. Match catalog capabilities to your operational needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read-write interoperability.&lt;/strong&gt;&lt;br&gt;
Some catalogs expose the full Iceberg REST Catalog APIs so any compatible engine can read and write tables. Other offerings restrict external writes or recommend using specific engines for writes. These differences change how you design ingestion and cross-engine workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Server-side performance features.&lt;/strong&gt;&lt;br&gt;
Catalogs vary in how much they manage table health for you. A few provide automated compaction, delete-file handling, and lifecycle management. Others leave those tasks to your teams and to open-source engines. If you want fewer operational jobs, prioritize a catalog with built-in performance management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vendor neutrality versus added convenience.&lt;/strong&gt;&lt;br&gt;
A catalog that automates maintenance reduces day-to-day work. It also increases dependency on that vendor’s maintenance model. If your priority is full independence across engines then you may prefer a catalog that implements the Iceberg REST spec faithfully so you can plan for external maintenance processes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Costs and Compatibility&lt;/strong&gt;
Some catalogs may be limited on which storage providers they can work with or may charge just for usage of the catalog even if you use external compute and this should be considered.&lt;/p&gt;
&lt;p&gt;A short checklist to evaluate a candidate catalog&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Does it implement the Iceberg REST Catalog APIs for both reads and writes?&lt;/li&gt;
&lt;li&gt;Does it provide automatic table maintenance or only catalog services?&lt;/li&gt;
&lt;li&gt;What write restrictions or safety guards exist for external engines?&lt;/li&gt;
&lt;li&gt;Which clouds and storage systems does it support?&lt;/li&gt;
&lt;li&gt;Are there extra costs to using the catalog?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use this checklist when you compare offerings. It helps reveal trade-offs between operational simplicity and multi-engine freedom.&lt;/p&gt;
&lt;h2&gt;Catalog Optimization: Native vs. Neutral Approaches&lt;/h2&gt;
&lt;p&gt;Once your Iceberg tables are in place, keeping them fast and cost-effective becomes a daily concern. File sizes grow unevenly, delete files stack up, and query times creep higher. This is where table optimization comes in - and where catalog differences start to matter.&lt;/p&gt;
&lt;p&gt;Most commercial catalogs fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Native Optimization Available&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manual Optimization Required&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Native Optimization Available&lt;/h3&gt;
&lt;p&gt;Vendors like &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;Databricks Unity Catalog&lt;/strong&gt; offer built-in optimization features that automatically manage compaction, delete file cleanup, and snapshot pruning. These features are often tightly integrated into their orchestration layers or compute engines.&lt;/p&gt;
&lt;p&gt;Benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No need to schedule Spark or Flink jobs manually&lt;/li&gt;
&lt;li&gt;Optimizations are triggered based on metadata activity&lt;/li&gt;
&lt;li&gt;Helps reduce cloud storage costs and improve query performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tradeoff:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;These features are often proprietary and non-transferable. If you move catalogs or engines, you may lose automation and need to build optimization pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Catalog-Neutral or Manual Optimization&lt;/h3&gt;
&lt;p&gt;Some catalogs, including open-source options like Apache Polaris, don&apos;t come with built-in optimization. Instead, you have two options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Run your own compaction pipelines&lt;/strong&gt; using engines like Spark or Flink. Can also manually orchestrate Dremio&apos;s OPTIMIZE and VACUUM commands with any catalog.&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;catalog-neutral optimization service&lt;/strong&gt; like &lt;strong&gt;Ryft.io&lt;/strong&gt;, which works across any REST-compatible catalog, but currently only supports storage on AWS, Azure, or GCP. There is also the open source Apache Amoro which automates the use of Spark based optimizations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This route offers maximum flexibility but requires:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Engineering effort to configure and monitor compaction&lt;/li&gt;
&lt;li&gt;Knowledge of best practices for tuning optimization jobs&lt;/li&gt;
&lt;li&gt;A way to coordinate across engines to avoid conflicting writes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short: if optimization is a feature you want off your plate, look for a catalog that handles it natively. If you prefer full control or need a more cloud-agnostic setup, neutral optimization tools or open workflows may serve you better.&lt;/p&gt;
&lt;h2&gt;What If Native Optimization Doesn’t Exist?&lt;/h2&gt;
&lt;p&gt;Not every catalog includes built-in optimization. If you&apos;re using a minimal catalog, or one that prioritizes openness over orchestration, you’ll need to handle performance tuning another way. That’s not a dealbreaker, but it does require a decision.&lt;/p&gt;
&lt;p&gt;Here are the two main paths forward when native optimization isn’t part of the package:&lt;/p&gt;
&lt;h3&gt;Option 1: Build Your Own Optimization Pipelines&lt;/h3&gt;
&lt;p&gt;Apache Iceberg is fully compatible with open engines like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Flink&lt;/strong&gt;, and &lt;strong&gt;Dremio&lt;/strong&gt;. Each of these supports table maintenance features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;File compaction&lt;/li&gt;
&lt;li&gt;Manifest rewriting&lt;/li&gt;
&lt;li&gt;Snapshot expiration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can schedule these jobs using tools like Airflow or dbt, or embed them directly into your data ingestion flows. This approach works in any environment, including on-prem, hybrid, and cloud.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Complete flexibility in how and when you optimize&lt;/li&gt;
&lt;li&gt;Can tailor jobs to match data patterns and storage costs&lt;/li&gt;
&lt;li&gt;Fully open and vendor-independent&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires engineering effort to build, monitor, and tune jobs&lt;/li&gt;
&lt;li&gt;No centralized UI or automation unless you build one&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Option 2: Use a Catalog-Neutral Optimization Vendor&lt;/h3&gt;
&lt;p&gt;Vendors like Ryft.io offer managed optimization services designed specifically for Iceberg. These tools run outside your query engines and handle compaction, cleanup, and layout improvements without relying on any one catalog or engine.&lt;/p&gt;
&lt;p&gt;NOTE: Apache Amoro offers an open source optimization tool if looking for an open source option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Key detail&lt;/strong&gt;: Ryft currently only supports deployments that store data in &lt;strong&gt;AWS S3&lt;/strong&gt;, &lt;strong&gt;Azure Data Lake&lt;/strong&gt;, or &lt;strong&gt;Google Cloud Storage&lt;/strong&gt;. If you&apos;re using on-prem HDFS or other object stores, this may not be viable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No need to manage optimization logic&lt;/li&gt;
&lt;li&gt;Works across multiple compute engines and catalogs&lt;/li&gt;
&lt;li&gt;Keeps optimization decoupled from platform lock-in&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Limited to major cloud object storage unless using Apache Amoro&lt;/li&gt;
&lt;li&gt;Adds another vendor and billing model to your stack&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When native optimization isn’t available, the best path depends on your team’s appetite for operational work. DIY gives you control. Neutral services give you speed. Either way, optimization remains a critical layer - whether you manage it yourself or let someone else handle it.&lt;/p&gt;
&lt;h2&gt;5. The Interoperability Spectrum&lt;/h2&gt;
&lt;p&gt;One of the key promises of Apache Iceberg is engine interoperability. The Iceberg REST Catalog API was designed so any compliant engine: whether it&apos;s Spark, Flink, Trino, or Dremio, can access tables the same way. But in practice, not all catalogs offer equal levels of interoperability.&lt;/p&gt;
&lt;p&gt;Some catalogs expose full &lt;strong&gt;read/write access&lt;/strong&gt; to external engines using the REST API. Others allow only reads - or place restrictions on how writes must be performed. This creates a spectrum, where catalogs differ in how open or engine-specific they are.&lt;/p&gt;
&lt;p&gt;Here’s how several major catalogs compare:&lt;/p&gt;
&lt;p&gt;| Catalog                          | External Read Access | External Write Access | REST Spec Coverage | Notes                                                                                    |
| -------------------------------- | -------------------- | --------------------- | ------------------ | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| &lt;strong&gt;Dremio Catalog&lt;/strong&gt;               | ✅ Full              | ✅ Full               | ✅ Full            | Based on Apache Polaris; full multi-engine support; no cost for external reads/writes    |
| &lt;strong&gt;Apache Polaris (Open Source)&lt;/strong&gt; | ✅ Full              | ✅ Full               | ✅ Full            | Vendor-neutral, open REST catalog, deploy yourself or get managed by Dremio or Snowflake |
| &lt;strong&gt;Databricks Unity Catalog&lt;/strong&gt;     | ✅ Full              | ✅ Full               | ✅ Full            | Optimization services are primarily Delta Lake Centered                                  |
| &lt;strong&gt;AWS Glue &amp;amp; AWS S3 Tables&lt;/strong&gt;     | ✅ Full              | ✅ Full               | ✅ Full            |                                                                                          |
| &lt;strong&gt;Google BigLake Metastore&lt;/strong&gt;     | ✅ Full              | ✅ Full               | ✅ Full (preview)  |                                                                                          |
| &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt;       | ✅ Full              | ✅ Full               | ✅ Full            | Based on Apache Polaris; Charged for requests to catalog from external reads/writes      |
| &lt;strong&gt;Snowflake Managed Tables&lt;/strong&gt;     | ✅ Full              | ❌ None               | ❌ None            | Tables can be externally read using Snowflake&apos;s SDK                                      |
| &lt;strong&gt;Microsoft OneLake&lt;/strong&gt;            | ✅ Full (Preview)    | ✅ Virtualized Writes | ✅ Full (preview)  | ✅ Virtualized via XTable                                                                | Implements Iceberg REST Catalog API in preview. Uses XTable for bi‑directional Delta ↔ Iceberg interop; supports real‑time delete‑vector translation. Iceberg layer is projected from Delta metadata. |
| &lt;strong&gt;MinIO AIStor&lt;/strong&gt;                 | ✅ Full              | ✅ Full               | ✅ Full            | ⚠️ Storage‑level Optimization Only                                                       | Integrates the Iceberg REST Catalog API directly into object storage. Eliminates need for external catalog DB. Optimized for high‑concurrency AI workloads. Best for self‑hosted or private‑cloud use. |
| &lt;strong&gt;Confluent TableFlow&lt;/strong&gt;          | ✅ Full              | ⚠️ Limited            | ✅ Full            | ⚠️ Fixed Snapshot Retention                                                              | Bridges Kafka topics to Iceberg tables. Automatic snapshot retention (10–100), no schema evolution. Uses Confluent‑managed Iceberg REST Catalog with credential vending.                               |
| &lt;strong&gt;DataHub Iceberg Catalog&lt;/strong&gt;      | ✅ Full              | ✅ Full               | ✅ Full            | S3 Only                                                                                  |                                                                                                                                                                                                        |&lt;/p&gt;
&lt;h3&gt;What This Means for You&lt;/h3&gt;
&lt;p&gt;If your architecture depends on multiple engines, the safest route is to choose a catalog that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implements the full Iceberg REST spec&lt;/li&gt;
&lt;li&gt;Allows both reads and writes from all compliant engines&lt;/li&gt;
&lt;li&gt;Avoids redirecting writes through proprietary services or SDKs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn’t just about standards, it’s about reducing long-term friction. The more interoperable your catalog, the easier it is to plug in new tools, migrate workloads, or share datasets across teams without rewriting pipelines or triggering lock-in.&lt;/p&gt;
&lt;h3&gt;Architectural Patterns: Choosing the Right Iceberg Catalog for Your Stack&lt;/h3&gt;
&lt;p&gt;With a clear understanding of feature capabilities across commercial Iceberg catalogs, the next consideration is architectural alignment. How should teams select a catalog based on their engine stack, deployment model, and optimization philosophy?&lt;/p&gt;
&lt;p&gt;Here, we explore common deployment patterns and their implications:&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Single-Engine Simplicity&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Organizations standardized on one compute engine seeking high performance and low operational overhead.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅ &lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Seamless integration between compute and catalog.&lt;/li&gt;
&lt;li&gt;Native optimization features (e.g., OPTIMIZE TABLE, Z-Ordering).&lt;/li&gt;
&lt;li&gt;Simplified access control and performance tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;⚠️ &lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;May impose file format restrictions (e.g., Parquet-only).&lt;/li&gt;
&lt;li&gt;Optimization tightly coupled to engine but if REST-Spec is ahered to you can still develop your own optimization pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; Dremio, Databricks Unity Catalog, AWS Glue (with managed compute).&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Multi-Engine Interop (Spark + Trino + Flink)&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Organizations with complex, multi-engine environments that require consistent metadata across tools and Clouds.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Use the right engine for the right job (ETL, BI, ML).&lt;/li&gt;
&lt;li&gt;Maximize transactional openness via full IRC support.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Optimization is either manual or vendor-dependent.&lt;/li&gt;
&lt;li&gt;Catalog-neutral solutions may lack server-side performance tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; Dremio Enterprise Catalog, Snowflake Open Catalog. (Both based on Apache Polaris)&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Streaming-First Architectures&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Teams integrating real-time data from Kafka into the lakehouse for analytics or ML.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Stream-native catalog (e.g., Confluent TableFlow) materializes Kafka topics into Iceberg tables.&lt;/li&gt;
&lt;li&gt;Seamless schema registration and time-travel.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;No schema evolution.&lt;/li&gt;
&lt;li&gt;Limited optimization control (rigid snapshot retention).&lt;/li&gt;
&lt;li&gt;Often designed for read-heavy use cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; Confluent TableFlow, integrated with external catalogs for downstream processing.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Cloud-Embedded Storage Catalogs&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Teams deploying AI or analytics workloads in private/hybrid cloud environments.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Built-in REST Catalog support directly within storage (MinIO AIStor).&lt;/li&gt;
&lt;li&gt;Simplifies deployment: no separate metadata layer deployment.&lt;/li&gt;
&lt;li&gt;High concurrency and transactional consistency at scale.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Tightly bound to object storage vendor.&lt;/li&gt;
&lt;li&gt;No native table optimization&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; MinIO AIStor (on-premise/private cloud), AWS S3 Tables (cloud-native equivalent).&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Governance-Led Architectures&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Enterprises prioritizing metadata lineage, compliance, and discovery.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Centralized metadata layer for observability and access management.&lt;/li&gt;
&lt;li&gt;Easy discovery and tracking across teams and tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;No native write capabilities (metadata-only catalog).&lt;/li&gt;
&lt;li&gt;Optimization must be handled by external systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; DataHub Iceberg Catalog (OSS or Cloud) or using an external catalog (Dremio Catalog, Apache Polaris) connected into Datahub.&lt;/p&gt;
&lt;p&gt;Each pattern has architectural trade-offs. Rather than seeking a perfect catalog, successful teams prioritize &lt;strong&gt;alignment with workflow needs&lt;/strong&gt;: engine independence, optimization automation, governance, or real-time ingestion. In some cases, hybrid strategies, like dual catalogs or catalog, neutral optimization overlays - provide the best of both worlds.&lt;/p&gt;
&lt;h2&gt;Optimization Strategy Trade-offs: Native, Manual, or Vendor-Neutral&lt;/h2&gt;
&lt;p&gt;Once an organization selects a catalog, the next major architectural decision is how to &lt;strong&gt;maintain and optimize Iceberg tables&lt;/strong&gt;. While the IRC standard guarantees transactional consistency, it says nothing about how tables should be optimized over time to preserve performance and control storage costs.&lt;/p&gt;
&lt;p&gt;Three primary approaches emerge:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Native Optimization (Catalog-Integrated Automation)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Many commercial catalogs offer built-in optimization features tightly coupled with their own compute engines. These include operations such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Compaction (file size tuning)&lt;/li&gt;
&lt;li&gt;Delete file rewriting&lt;/li&gt;
&lt;li&gt;Snapshot expiration&lt;/li&gt;
&lt;li&gt;Partition clustering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Platforms like &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;Databricks&lt;/strong&gt; provide SQL-native or automated processes (e.g., &lt;code&gt;OPTIMIZE TABLE&lt;/code&gt;, auto-compaction) that manage these operations behind the scenes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;✅ &lt;em&gt;Pros:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zero setup - optimization is automatic or declarative.&lt;/li&gt;
&lt;li&gt;Built-in cost and performance tuning.&lt;/li&gt;
&lt;li&gt;Reduces engineering overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;⚠️ &lt;em&gt;Cons:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Usually catalog-bound.&lt;/li&gt;
&lt;li&gt;Often restricted to Parquet format.&lt;/li&gt;
&lt;li&gt;Switching catalogs later requires reengineering optimization logic.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Manual Optimization (Bring Your Own Engine)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Open-source Iceberg supports all required lifecycle management operations: compaction, snapshot cleanup, rewrite manifests, but leaves it up to users to implement these jobs using engines like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Flink&lt;/strong&gt;, &lt;strong&gt;Apache Amoro&lt;/strong&gt; or &lt;strong&gt;Trino&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;✅ &lt;em&gt;Pros:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Total freedom: no vendor lock-in.&lt;/li&gt;
&lt;li&gt;Can be integrated into any data pipeline or orchestration framework (Airflow, dbt, Dagster).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;⚠️ &lt;em&gt;Cons:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires custom development and scheduling.&lt;/li&gt;
&lt;li&gt;Monitoring and tuning are the user&apos;s responsibility.&lt;/li&gt;
&lt;li&gt;Risk of misconfiguration or inconsistent maintenance across tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This model works well with catalogs like &lt;strong&gt;Apache Polaris&lt;/strong&gt;, &lt;strong&gt;OneLake&lt;/strong&gt;, or &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt;, which support R/W operations but do not enforce optimization strategies.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Catalog-Neutral Optimization Vendors (e.g., Ryft.io)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;A newer middle ground is emerging with vendors like &lt;strong&gt;Ryft.io&lt;/strong&gt;, which offer catalog-agnostic optimization as a service. These platforms connect to your existing Iceberg tables: via any REST-compliant catalog, and run automated optimization jobs externally.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;✅ &lt;em&gt;Pros:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Centralized, automated optimization regardless of catalog.&lt;/li&gt;
&lt;li&gt;Maintains interoperability and neutrality.&lt;/li&gt;
&lt;li&gt;Works across major cloud storage (e.g., S3, ADLS, GCS).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;⚠️ &lt;em&gt;Cons:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Still a maturing category.&lt;/li&gt;
&lt;li&gt;Requires compatible storage (cloud object stores).&lt;/li&gt;
&lt;li&gt;Additional cost and integration complexity.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is particularly valuable in multi-engine or multi-catalog environments where optimization cannot be centrally enforced but must still be automated and reliable.&lt;/p&gt;
&lt;h3&gt;Summary: The Optimization Dilemma&lt;/h3&gt;
&lt;p&gt;There is no one-size-fits-all solution:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Primary Trade-off&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native Optimization&lt;/td&gt;
&lt;td&gt;Simplicity, integrated platforms&lt;/td&gt;
&lt;td&gt;Vendor lock-in, format constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual (BYO Engine)&lt;/td&gt;
&lt;td&gt;Open source, full control&lt;/td&gt;
&lt;td&gt;Operational complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor-Neutral (Ryft)&lt;/td&gt;
&lt;td&gt;Multi-cloud &amp;amp; multi-engine ops&lt;/td&gt;
&lt;td&gt;Added service dependency, still emerging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Choosing an optimization strategy is not just about performance - it’s a decision about &lt;strong&gt;how much control you need&lt;/strong&gt;, &lt;strong&gt;how much complexity you can absorb&lt;/strong&gt;, and &lt;strong&gt;how much optionality you want to preserve&lt;/strong&gt; in your architecture.&lt;/p&gt;
&lt;h2&gt;Architectural Patterns for Balancing Optimization and Interoperability&lt;/h2&gt;
&lt;p&gt;As organizations adopt Apache Iceberg REST Catalogs (IRC) to decouple compute from metadata, a recurring challenge emerges: how to balance &lt;strong&gt;open interoperability&lt;/strong&gt; with the benefits of &lt;strong&gt;proprietary optimization&lt;/strong&gt;. No single approach satisfies every use case. Instead, data architects are increasingly designing &lt;strong&gt;hybrid strategies&lt;/strong&gt; that reflect the unique demands of their data workflows, regulatory environments, and performance SLAs.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Read-Only Catalogs Paired with External Optimization&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Some catalogs, provide high-performance read access to Iceberg tables but restrict external writes via IRC. In these scenarios, organizations may:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maintain &lt;strong&gt;a separate write-optimized catalog&lt;/strong&gt; (e.g., Apache Polaris, Nessie, or Glue) for ingestion, transformation and optimization.&lt;/li&gt;
&lt;li&gt;Expose tables to the read-optimized catalog &lt;strong&gt;after ingestion and optimization is complete&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Schedule synchronization jobs to ensure both catalogs reference consistent metadata snapshots.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-catalog approach preserves the performance of engines with these restrictions while maintaining &lt;strong&gt;external transactional control&lt;/strong&gt; via a neutral or R/W-capable catalog.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Pros:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Best of both worlds: performance + flexibility.&lt;/li&gt;
&lt;li&gt;Avoids modifying data in restrictive environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Cons:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Adds metadata orchestration complexity.&lt;/li&gt;
&lt;li&gt;Difficult to manage at high scale without automation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Embedded Catalogs for Self-Managed Environments&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Solutions like &lt;strong&gt;MinIO AIStor&lt;/strong&gt; and &lt;strong&gt;Dremio Enterprise Catalog&lt;/strong&gt; take a radically different approach, embedding the IRC layer directly into the object store or lakehouse platform in Dremio&apos;s case. This creates a streamlined deployment architecture for &lt;strong&gt;private cloud, hybrid, or air-gapped&lt;/strong&gt; environments where full control is required.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enables transactional Iceberg workloads without deploying a separate metadata database.&lt;/li&gt;
&lt;li&gt;Suited for exascale, high-concurrency AI/ML pipelines.&lt;/li&gt;
&lt;li&gt;Can be used alongside external catalogs for metadata synchronization if needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This model is also increasingly relevant for regulated industries or enterprises seeking on-premise lakehouse designs with built-in metadata authority.&lt;/p&gt;
&lt;h4&gt;3. &lt;strong&gt;Virtualized Format Interop via Metadata Translation&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Microsoft OneLake&lt;/strong&gt;, using &lt;strong&gt;Apache XTable&lt;/strong&gt;, pioneers a virtualized metadata model. Instead of writing new Iceberg tables, XTable &lt;strong&gt;projects Iceberg-compatible metadata from Delta Lake&lt;/strong&gt; tables in OneLake.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅ Enables external Iceberg engines to query Delta-based data with no duplication.&lt;/li&gt;
&lt;li&gt;🔄 Metadata is derived dynamically, enabling near real-time interop.&lt;/li&gt;
&lt;li&gt;⚠️ Complex Iceberg-native features may be unsupported due to reliance on Delta primitives.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This architecture is ideal for organizations deeply committed to Delta Lake but wanting to provide &lt;strong&gt;Iceberg-compatible access&lt;/strong&gt; for federated analytics or open-source tools.&lt;/p&gt;
&lt;h3&gt;Architectural Takeaway: Mix and Match for Your Use Case&lt;/h3&gt;
&lt;p&gt;The modern Iceberg ecosystem isn’t about picking a single vendor. Instead, it’s about selecting interoperable components that align with your architecture&apos;s &lt;strong&gt;performance, governance, and flexibility goals&lt;/strong&gt;.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Catalog Strategy&lt;/th&gt;
&lt;th&gt;Optimization Path&lt;/th&gt;
&lt;th&gt;Interop Balance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloud-native with automation&lt;/td&gt;
&lt;td&gt;AWS Glue, Dremio&lt;/td&gt;
&lt;td&gt;Native Automation&lt;/td&gt;
&lt;td&gt;High (if Parquet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-engine, multi-cloud&lt;/td&gt;
&lt;td&gt;Dremio Catalog, Snowflake Open Catalog&lt;/td&gt;
&lt;td&gt;Built on OSS with Full Interop&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Private/Hybrid cloud&lt;/td&gt;
&lt;td&gt;MinIO AIStor or Dremio Catalog&lt;/td&gt;
&lt;td&gt;Embedded in software for lakehouse storage or lakehouse engine&lt;/td&gt;
&lt;td&gt;Medium–High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stream → Lakehouse&lt;/td&gt;
&lt;td&gt;Confluent TableFlow&lt;/td&gt;
&lt;td&gt;Fixed strategy (snapshots)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delta → Iceberg bridge&lt;/td&gt;
&lt;td&gt;OneLake + XTable&lt;/td&gt;
&lt;td&gt;Virtualized sync&lt;/td&gt;
&lt;td&gt;High for reads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Designing an effective catalog strategy means embracing modularity, using REST interoperability as the glue while tailoring optimization and governance layers to the needs of your teams.&lt;/p&gt;
&lt;h3&gt;Conclusion: Choosing the Right Iceberg Catalog for Your Strategy&lt;/h3&gt;
&lt;p&gt;The Apache Iceberg REST Catalog ecosystem has matured into a diverse landscape of offerings: each with its own balance of &lt;strong&gt;interoperability&lt;/strong&gt;, &lt;strong&gt;optimization capability&lt;/strong&gt;, and &lt;strong&gt;vendor integration strategy&lt;/strong&gt;. From hyperscalers to open-source initiatives, every catalog presents unique strengths and trade-offs.&lt;/p&gt;
&lt;p&gt;At the heart of this evolution is a simple but profound architectural truth:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Compute and metadata must decouple - but performance, governance, and interoperability must still align.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;🧠 Key Takeaways&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If performance and simplicity are your top priorities&lt;/strong&gt;, a native-optimization platform like Dremio, Databricks, or AWS Glue offers seamless, powerful lifecycle management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If complete control and flexibility across tools and clouds matter more&lt;/strong&gt;, choose a self-managed catalog like Apache Polaris and prepare to invest in your own optimization pipeline or use a neutral optimizer like Ryft.io (when on major cloud object storage) or use the OSS Apache Amoro.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you&apos;re locked into an analytics platform&lt;/strong&gt; like Snowflake or BigQuery, understand the implications of the differing level of Iceberg support on these platforms.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Among the platforms reviewed, Dremio strikes a rare balance: offering full Iceberg REST compatibility, native R/W support from any engine, and automated optimization, without locking users into its compute layer.&lt;/p&gt;
&lt;p&gt;Unlike platforms that charge per API call or limit external writes, Dremio only charges for compute &lt;strong&gt;run through Dremio itself&lt;/strong&gt;, meaning you can leverage external engines freely while still benefiting from the platform’s integrated catalog.&lt;/p&gt;
&lt;p&gt;This model promotes &lt;strong&gt;interoperability and performance without compromise&lt;/strong&gt;, aligning with the core principles of the Iceberg Lakehouse architecture: open metadata, multi-engine flexibility, and governed performance.&lt;/p&gt;
&lt;h3&gt;Final Thought&lt;/h3&gt;
&lt;p&gt;The Iceberg REST Catalog isn’t just an API spec, it’s the foundation for a new kind of lakehouse: open, transactional, and cloud-agnostic. Your choice of catalog defines how far you can scale without friction.&lt;/p&gt;
&lt;p&gt;Choose wisely.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building a Universal Lakehouse Catalog - Beyond Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2025-10-Building-Universal-Lakehouse-Catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-Building-Universal-Lakehouse-Catalog/</guid><description>
**Get Data Lakehouse Books:**

- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](h...</description><pubDate>Fri, 17 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://open.spotify.com/show/2PRDrWVpgDvKxN6n1oUsJF?si=e1a55e628ce74a10&quot;&gt;Will be recording an episode on this topic on my podcast, so please subscribe to the podcast to not miss it (Also on iTunes and other directories)&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Apache Iceberg has done something few projects manage to pull off, it created a standard. Its table format and REST-based catalog interface made it possible for different engines to read, write, and govern the same data without breaking consistency. That’s a big deal. For the first time, organizations could mix and match engines while keeping one clean, transactional view of their data.&lt;/p&gt;
&lt;p&gt;But this success brings new expectations.&lt;/p&gt;
&lt;p&gt;As lakehouse adoption grows, teams want more than just Iceberg tables under one roof. They want to treat &lt;em&gt;all&lt;/em&gt; their datasets, raw Parquet files, streaming logs, external APIs, or even other formats like Delta and Hudi, with the same consistency and governance. The problem? Today’s Iceberg catalogs don’t support that. They’re built for Iceberg tables only.&lt;/p&gt;
&lt;p&gt;So how do we move beyond that? How do we build a &lt;strong&gt;universal&lt;/strong&gt; lakehouse catalog that works across engines &lt;em&gt;and&lt;/em&gt; across formats?&lt;/p&gt;
&lt;p&gt;Let’s explore two possible paths and what’s still missing.&lt;/p&gt;
&lt;h2&gt;Iceberg’s Success: A Case Study in Standardization&lt;/h2&gt;
&lt;p&gt;To understand where catalogs could go next, it helps to look at what made Iceberg successful in the first place.&lt;/p&gt;
&lt;p&gt;Before Iceberg, working with data lakes was messy. You could store files in open formats like Parquet or ORC, but there was no clean way to manage schema changes, version history, or transactional consistency. Each engine had to implement its own logic, or worse, teams had to build brittle pipelines to fill in the gaps.&lt;/p&gt;
&lt;p&gt;Iceberg changed that. It introduced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A table format that handles schema evolution, ACID transactions, and partitioning without sacrificing openness.&lt;/li&gt;
&lt;li&gt;A catalog interface that lets any engine discover tables and retrieve metadata in a consistent way.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two specs, the table format and the REST catalog interface, created a plug-and-play model. Spark, Flink, Trino, Dremio, and others could all speak the same language. As a result, Iceberg became the neutral zone. No vendor lock-in, no hidden contracts.&lt;/p&gt;
&lt;p&gt;But that neutrality came with a scope: Iceberg REST Catalog only tracks and governs &lt;strong&gt;Iceberg tables&lt;/strong&gt;. If your dataset isn’t an Iceberg table, there is no modern open interoperable standard for governing and accessing. And that’s where the limitation begins.&lt;/p&gt;
&lt;h2&gt;The Problem: No Standards Beyond Iceberg&lt;/h2&gt;
&lt;p&gt;While Iceberg catalogs are tightly defined for Iceberg tables, some catalogs &lt;em&gt;do&lt;/em&gt; allow you to register other types of datasets, raw Parquet, Delta tables, external views, or even API-based data sources.&lt;/p&gt;
&lt;p&gt;But there’s a catch.&lt;/p&gt;
&lt;p&gt;Each catalog handles this differently. One might use a custom registration API, another might expose a metadata file format, and yet another might treat external sources as virtual tables with limited capabilities. The result is a patchwork of behavior:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Some tools can read those datasets.&lt;/li&gt;
&lt;li&gt;Some can&apos;t see them at all.&lt;/li&gt;
&lt;li&gt;Others behave inconsistently depending on the engine and the catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes interoperability fragile. What works in one engine may not work in another, even if they both support the same table format. Teams are left stitching together workarounds or writing custom integrations just to get basic access across systems.&lt;/p&gt;
&lt;p&gt;So what’s really missing here? A &lt;strong&gt;standard API&lt;/strong&gt; for non-Iceberg datasets. Something that defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How to register a dataset that isn&apos;t an Iceberg table.&lt;/li&gt;
&lt;li&gt;How to describe its metadata (schema, location, stats).&lt;/li&gt;
&lt;li&gt;How to govern access across different engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The big question is: where should this standard come from, and what should it look like?&lt;/p&gt;
&lt;h2&gt;Where Should the Standard Come From?&lt;/h2&gt;
&lt;p&gt;This brings us to the real crossroads: if we need a standard API for universal lakehouse catalogs, where should it come from?&lt;/p&gt;
&lt;p&gt;There are a few possibilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Should it come from the Iceberg REST spec?&lt;/strong&gt;&lt;br&gt;
That would keep things in the same family and build on an existing community standard. But Iceberg’s current REST spec is tightly scoped around Iceberg tables, and expanding it to cover other data types could be a big shift and expand the project beyond what the community may be comfortable with.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Should it be defined inside a single catalog project like Polaris or Unity?&lt;/strong&gt;&lt;br&gt;
A vendor-backed project can move quickly, implement end-to-end features, and ship a working solution but then be a source of lock-in. If an open standard catalog dominates, then it becomes the home of the API standard by default.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Is it acceptable if the spec starts with a vendor?&lt;/strong&gt;&lt;br&gt;
Maybe. If that vendor drives real adoption and the API is later opened up, it can evolve into a neutral standard. But it would need wide buy-in and careful governance, to avoid becoming another moving target.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No matter how you look at it, there are really only two main paths forward:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An implementation becomes the de facto standard.&lt;/strong&gt;&lt;br&gt;
One catalog (open source or commercial) builds enough momentum that its API becomes the standard, similar to how S3 became the API for object storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A neutral API spec is created independently.&lt;/strong&gt;&lt;br&gt;
This would follow the Iceberg model, where the spec came first, then vendors and engines built around it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If history teaches us anything, it’s that vendor-driven standards can create long-term friction. S3 is a good example: it&apos;s ubiquitous, but it’s also tightly bound to a single provider’s roadmap leading to a whack-a-mole like catch up game for those who support the API they have no control over. That experience shaped how the industry approached table formats, this time, the community came together around Iceberg to avoid that kind of lock-in and vendor catch-up.&lt;/p&gt;
&lt;p&gt;So whatever path we take toward universal cataloging, the smart money is on a &lt;strong&gt;community standard&lt;/strong&gt;. The only question is whether that standard comes from an existing implementation, or from a new, vendor-neutral spec that everyone agrees to follow.&lt;/p&gt;
&lt;h2&gt;Exploring the Implementation-First Path: Apache Polaris and Table Sources&lt;/h2&gt;
&lt;p&gt;If the path to a universal catalog starts with an implementation, Apache Polaris (incubating) is worth watching closely. Among the open catalog projects, Polaris stands out for two reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It&apos;s built as an open implementation of the Apache Iceberg REST Catalog spec.&lt;/li&gt;
&lt;li&gt;It&apos;s actively proposing new features to extend catalog support beyond Iceberg tables.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;While Polaris already supports Iceberg tables through the standard REST interface, it&apos;s exploring how to bring non-Iceberg datasets into the same catalog. This includes both structured file-based datasets like Parquet or JSON, and unstructured data like images, PDFs, or videos.&lt;/p&gt;
&lt;p&gt;Right now, Polaris includes a feature called &lt;strong&gt;Generic Tables&lt;/strong&gt;, but a more robust proposal called &lt;strong&gt;Table Sources&lt;/strong&gt; is under active discussion.&lt;/p&gt;
&lt;h3&gt;What Are Table Sources?&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://lists.apache.org/thread/652z1f1n2pgf3g2ow5y382wlrtnoqth0&quot;&gt;Discussion of this proposal on the Dev List&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt; are a proposed abstraction that lets Polaris register and govern external data that isn’t already an Iceberg table. Instead of forcing everything into the Iceberg format, Polaris acts as a bridge: mapping object storage locations to queryable tables using metadata services that live outside the catalog itself.&lt;/p&gt;
&lt;p&gt;Each &lt;strong&gt;Table Source&lt;/strong&gt; includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A name (used as the table identifier)&lt;/li&gt;
&lt;li&gt;A source type (structured data, unstructured objects, or Iceberg metadata)&lt;/li&gt;
&lt;li&gt;A configuration (like file format, storage location, credentials, filters, and refresh intervals)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Table Source&lt;/strong&gt;: Represents structured files like Parquet or JSON. These are registered read-only tables with metadata generated by an external service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Object Table Source&lt;/strong&gt;: Describes unstructured data like videos or documents, exposing file metadata (size, path, modification time) in table format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iceberg Table Source&lt;/strong&gt;: Adapts metadata from existing Iceberg tables stored outside Polaris.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Polaris doesn’t scan or interpret these datasets directly. Instead, &lt;strong&gt;Source Services&lt;/strong&gt;, external processes, use the registered configurations to scan file systems, generate table metadata, and push it back to Polaris. This decouples the engine from the source and the catalog from the scanning logic.&lt;/p&gt;
&lt;p&gt;At query time, engines can interact with these registered tables using the same APIs as they would for Iceberg, even though the backing data may not follow Iceberg’s spec.&lt;/p&gt;
&lt;h3&gt;Why This Matters&lt;/h3&gt;
&lt;p&gt;If adopted, the &lt;strong&gt;Table Source&lt;/strong&gt; feature could give Polaris a head start as the reference implementation for a broader catalog API. It defines a reusable contract for registering external data, managing its lifecycle, and governing access, all in a way that’s decoupled from specific engines or formats.&lt;/p&gt;
&lt;p&gt;But this also raises the bigger question: will other catalogs follow this model? Will engines adopt the same contract for recognizing external data? Or will each system continue to define its own rules?&lt;/p&gt;
&lt;p&gt;That tension, between an evolving implementation like Polaris and the desire for an extension to the REST Catalog API standard, sets the stage for what comes next in the catalog story.&lt;/p&gt;
&lt;h2&gt;The API-First Path: Extending the Iceberg REST Catalog Spec&lt;/h2&gt;
&lt;p&gt;Now let’s explore the other side of the equation: what if instead of extending a specific implementation, we expanded the &lt;strong&gt;Iceberg REST Catalog specification itself&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;This approach would focus on defining a &lt;strong&gt;neutral contract&lt;/strong&gt; that any catalog, Polaris, Unity, Glue, or others, could implement to support more than just Iceberg tables. Rather than focusing on what a specific system can do today, it asks: &lt;em&gt;what could a future REST catalog look like if it supported universal datasets by design?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One of the most interesting signs of this potential is already in the spec: the &lt;strong&gt;Scan Planning Endpoint&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;What Is Scan Planning?&lt;/h3&gt;
&lt;p&gt;In the typical read path, an engine:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Requests a table from the catalog.&lt;/li&gt;
&lt;li&gt;The catalog responds with the metadata location.&lt;/li&gt;
&lt;li&gt;The engine reads the metadata files (manifests, snapshots, etc.) and plans which Parquet files to scan.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;But with the &lt;strong&gt;Scan Planning Endpoint&lt;/strong&gt;, the flow changes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The engine calls the endpoint directly.&lt;/li&gt;
&lt;li&gt;The catalog does the heavy lifting: it traverses the metadata, evaluates filters, and returns a &lt;strong&gt;list of data files&lt;/strong&gt; to scan.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This makes the engine’s job simpler if the catalog and engine support the endpoint. It no longer needs to understand Iceberg’s metadata structure. It just gets files to read.&lt;/p&gt;
&lt;h3&gt;Why This Matters for Universal Catalogs&lt;/h3&gt;
&lt;p&gt;By pushing scan planning into the catalog, the spec opens the door to something bigger:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The catalog could expose &lt;strong&gt;non-Iceberg&lt;/strong&gt; datasets, like Delta Lake, Hudi, or raw Parquet, and return scan plans for them.&lt;/li&gt;
&lt;li&gt;It could also &lt;strong&gt;cache metadata&lt;/strong&gt; in a relational database, avoiding repeated reads from object storage.&lt;/li&gt;
&lt;li&gt;Engines remain agnostic to metadata formats, they just scan files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a fundamental shift: the &lt;strong&gt;catalog becomes the query planner for metadata&lt;/strong&gt;, not just a metadata store.&lt;/p&gt;
&lt;p&gt;But here’s the big catch: this currently only exists on the &lt;strong&gt;read&lt;/strong&gt; side.&lt;/p&gt;
&lt;p&gt;There’s no equivalent in the spec today for the &lt;strong&gt;write path&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;A Hypothetical Write-Side Extension&lt;/h3&gt;
&lt;p&gt;Imagine this: instead of asking the engine to write metadata files (as is required today), the engine submits a write payload to the catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The namespace of the table&lt;/li&gt;
&lt;li&gt;The table type&lt;/li&gt;
&lt;li&gt;A list of new data files and associated summary statistics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The catalog could then:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Internally update its metadata, whether that’s JSON files, a manifest database, or some other format&lt;/li&gt;
&lt;li&gt;Enforce governance rules&lt;/li&gt;
&lt;li&gt;Trigger compaction or indexing tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this model, the catalog fully owns metadata management for both reads and writes. Engines don’t need to understand Iceberg’s internals, or any other format’s internals. They just write and read data and delegate everything else.&lt;/p&gt;
&lt;h3&gt;The Trade-Offs&lt;/h3&gt;
&lt;p&gt;This model is clean and powerful. It simplifies engine logic and opens the door for catalogs to support any file-based dataset. But it comes at a cost:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;catalog must be deeply optimized&lt;/strong&gt; to handle scan planning at scale.&lt;/li&gt;
&lt;li&gt;It must support high concurrency, incremental updates, and aggressive caching.&lt;/li&gt;
&lt;li&gt;Metadata operations become tightly coupled to catalog performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, this model places a lot more responsibility on the catalog itself. That’s not necessarily bad, but it changes the design expectations.&lt;/p&gt;
&lt;p&gt;Still, if the goal is to build a &lt;strong&gt;universal contract&lt;/strong&gt; for working with datasets across formats, pushing more of that logic into the catalog, via a standardized API that even the major cloud vendors follow, might be the path forward.&lt;/p&gt;
&lt;h2&gt;Comparing the Two Paths: Implementation vs. API Standard&lt;/h2&gt;
&lt;p&gt;Both the &lt;em&gt;Table Sources&lt;/em&gt; approach and the &lt;em&gt;Scan Planning API model&lt;/em&gt; offer ways to move beyond Iceberg-only catalogs. But they take fundamentally different routes. One starts by expanding what a specific catalog can do and becomes the standard if that catalog becomes the standard. The other extends an API Spec that is already an industry standard with a narrower scope (standardizing transactions with Iceberg tables).&lt;/p&gt;
&lt;p&gt;Let’s weigh the trade-offs.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Flexibility and Expressiveness&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources (Implementation-first)&lt;/strong&gt;&lt;br&gt;
✅ Easier to move quickly, Polaris can prototype and evolve features as it is a younger project with a younger community that can reach consensus quicker.&lt;br&gt;
✅ Can support structured and unstructured datasets with source-specific logic.&lt;br&gt;
✅ Avoid the lock-in of a vendor implementation becoming the standard, since Apache Polaris is a incubating Apache Project anyone can deploy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scan Planning Extension (API-first)&lt;/strong&gt;&lt;br&gt;
✅ Treats all datasets as files with a metadata interface, engines don’t need to know anything about the metadata format.&lt;br&gt;
✅ Opens the door for catalogs to expose Delta, Hudi, Paimon, or other sources using the same scan API.&lt;br&gt;
⚠️ Metadata management becomes much more complex for the catalog, especially for large tables or real-time use cases.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In both scenarios, there is still always the question of a specific engines support for reading different file formats or metadata formats. Although, in both scenarios the catalog can still be the central listing governing access to all lakehouse datasets.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Governance and Control&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt;&lt;br&gt;
✅ Catalog remains the system of record and point of governance.&lt;br&gt;
✅ Supports configuration-based registration, access control, and credential vending.&lt;br&gt;
⚠️ Each source type needs its own metadata strategy, increasing maintenance complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scan Planning + Write Delegation&lt;/strong&gt;&lt;br&gt;
✅ Centralizes all metadata handling, which could unify governance and simplify access rules.&lt;br&gt;
⚠️ Puts more strain on catalog durability, uptime, and scalability, it&apos;s now a bigger bottleneck for reads and writes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Ecosystem Alignment&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt;&lt;br&gt;
✅ Works well for ecosystems already aligned around Polaris or compatible systems.&lt;br&gt;
⚠️ Other catalogs would need to implement Polaris-compatible logic to ensure portability. (We saw catalogs adopt the Iceberg REST Spec as it become the standard, so there is precedent)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;REST Spec Extension&lt;/strong&gt;&lt;br&gt;
✅ Builds on a known spec (Iceberg REST), which already has buy-in across many vendors.&lt;br&gt;
✅ Keeps catalogs interchangeable if they adhere to the same read/write API contract.&lt;br&gt;
⚠️ Requires coordination and consensus across the community, which can slow down adoption.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Developer Experience&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt;&lt;br&gt;
✅ Clear division of responsibility: catalog governs metadata, engines execute logic.&lt;br&gt;
✅ External services (source services) handle complexity and can evolve independently.&lt;br&gt;
⚠️ Requires more infrastructure components to be deployed and maintained.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Extensions&lt;/strong&gt;&lt;br&gt;
✅ Simplifies engine logic, engines just hand off files and scan what they’re told.&lt;br&gt;
⚠️ Catalog APIs become more complex and require tighter validation of inputs and outputs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In practice, both paths have strengths, and challenges. A hybrid model could even emerge: catalogs like Polaris could lead the way with working implementations, while the community formalizes an API spec based on what works.&lt;/p&gt;
&lt;p&gt;The real question isn’t which is “better”, it’s which path brings the most durable, portable, and scalable standard to life.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Intro to Apache Iceberg with Apache Polaris and Apache Spark</title><link>https://iceberglakehouse.com/posts/2025-10-intro-to-apache-iceberg-with-apache-polaris-and-apache-spark/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-intro-to-apache-iceberg-with-apache-polaris-and-apache-spark/</guid><description>
**Get Data Lakehouse Books:**

- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](h...</description><pubDate>Thu, 16 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Modern analytics depend on flexibility. Teams want to query raw data with the same speed and reliability they expect from a warehouse. That goal led to the rise of the &lt;em&gt;data lakehouse&lt;/em&gt;, an architecture that unifies structured and unstructured data while supporting multiple compute engines.&lt;/p&gt;
&lt;p&gt;The lakehouse model removes silos by allowing data to live in open formats, accessible to tools like Spark, Trino, Dremio, and Flink. Interoperability becomes the foundation of this design: storage is separated from compute, and metadata lives in a shared catalog. Apache Iceberg sits at the center of this open ecosystem.&lt;/p&gt;
&lt;h2&gt;The Lakehouse and the Value of Interoperability&lt;/h2&gt;
&lt;p&gt;Traditional data systems often forced teams to choose between performance and openness. Data warehouses provided fast queries but required proprietary formats and vendor lock-in. Data lakes offered openness and low cost but lacked reliability and consistent schema management.&lt;/p&gt;
&lt;p&gt;The lakehouse combines both. It keeps data in object storage while using open table formats like Apache Iceberg to bring reliability, version control, and transactional guarantees. This allows multiple engines to read and write the same datasets without duplication.&lt;/p&gt;
&lt;p&gt;Interoperability is the key advantage. When organizations use open standards, they can build systems that evolve without re-platforming. Governance, lineage, and performance optimizations can be shared across tools, creating one consistent view of enterprise data.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg’s Role in the Lakehouse&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is the open table format that makes the lakehouse possible. It defines how large analytic tables are stored, versioned, and accessed in cloud or on-premises object storage. Iceberg tracks snapshots of data files, enabling ACID transactions, schema evolution, and time travel.&lt;/p&gt;
&lt;p&gt;Each Iceberg table is independent of any single compute engine. Spark, Dremio, Trino, and Flink can all operate on the same tables because the format defines a consistent API for reading and writing data. This makes Iceberg a shared foundation for analytics across the open data ecosystem.&lt;/p&gt;
&lt;p&gt;In practice, Iceberg replaces the old Hive Metastore model with a more scalable and flexible metadata structure. Tables are self-describing, and every change creates a new immutable snapshot. This design not only enables concurrency and rollback but also ensures that the same data can be reliably queried from different engines without conflict.&lt;/p&gt;
&lt;h2&gt;The Structure of an Apache Iceberg Table&lt;/h2&gt;
&lt;p&gt;An Apache Iceberg table is more than a collection of data files. It is a structured system that records every version of a dataset, allowing engines to read, write, and track changes with full transactional integrity. Understanding this structure helps explain how Iceberg enables features like time travel, schema evolution, and partition management.&lt;/p&gt;
&lt;p&gt;At the top level, each table has a &lt;strong&gt;metadata directory&lt;/strong&gt; that contains JSON files describing the current state of the table. These files point to &lt;strong&gt;snapshot metadata&lt;/strong&gt;, which lists all the data files that make up the current version. Each snapshot references one or more &lt;strong&gt;manifest lists&lt;/strong&gt;, and each manifest list points to multiple &lt;strong&gt;manifest files&lt;/strong&gt;. Manifest files contain the actual list of data files, typically Parquet, ORC, or Avro, along with partition information and statistics.&lt;/p&gt;
&lt;p&gt;Every time you insert, delete, or update data, Iceberg creates a new snapshot without rewriting the existing files. This immutable design ensures that multiple users and engines can safely interact with the same table at the same time. It also makes rollback and version tracking possible, since previous snapshots are always preserved until explicitly expired.&lt;/p&gt;
&lt;p&gt;Iceberg also introduces a flexible approach to partitioning. Instead of static directories like in Hive, Iceberg uses &lt;strong&gt;partition transforms&lt;/strong&gt; that record logical rules, such as &lt;code&gt;bucket(8, id)&lt;/code&gt; or &lt;code&gt;months(order_date)&lt;/code&gt;, directly in metadata. This allows the table to manage partitions dynamically, improving query performance while keeping partitioning transparent to users.&lt;/p&gt;
&lt;p&gt;Together, these components form a self-contained and versioned system that makes object storage behave like a transactional database. In the next section, you’ll set up an environment using Apache Polaris and Apache Spark to see how this structure works in practice.&lt;/p&gt;
&lt;h2&gt;Setting Up the Environment&lt;/h2&gt;
&lt;p&gt;To explore how Apache Iceberg works in practice, you’ll use a local setup that includes three components: &lt;strong&gt;Apache Polaris&lt;/strong&gt;, &lt;strong&gt;MinIO&lt;/strong&gt;, and &lt;strong&gt;Apache Spark&lt;/strong&gt;. Polaris will serve as the catalog that manages Iceberg metadata, MinIO will act as your S3-compatible storage system, and Spark will be your compute engine for creating and querying tables.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;catalog&lt;/strong&gt; in Iceberg defines where tables are stored and how their metadata is managed. It is responsible for keeping track of namespaces, table locations, and access control. Apache Polaris provides an open-source implementation of an Iceberg catalog that exposes a REST API for managing these operations. Polaris also adds governance features, authentication, roles, and permissions, making it more than just a metadata store.&lt;/p&gt;
&lt;p&gt;Within Polaris, users and services are represented as &lt;strong&gt;principals&lt;/strong&gt;, each with unique credentials that determine what they can access. You can assign roles and privileges to principals, giving them permission to create, update, or query catalogs and tables. This design allows multiple tools to share a single governed catalog while maintaining secure, fine-grained access.&lt;/p&gt;
&lt;h3&gt;Starting the Environment&lt;/h3&gt;
&lt;p&gt;Clone the quickstart repository and start the environment using Docker Compose:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/AlexMercedCoder/Apache-Polaris-Apache-Iceberg-Minio-Spark-Quickstart.git
cd Apache-Polaris-Apache-Iceberg-Minio-Spark-Quickstart
docker compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will launch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Polaris on port &lt;code&gt;8181&lt;/code&gt; (catalog API)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;MinIO on ports &lt;code&gt;9000&lt;/code&gt; and &lt;code&gt;9001&lt;/code&gt; (S3 and web console)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Spark with Jupyter Notebook on port &lt;code&gt;8888&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can verify that all containers are running with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker ps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once they’re up, open the Jupyter Notebook interface by visiting &lt;code&gt;http://localhost:8888&lt;/code&gt;. Create a new Python notebook and copy the contents of &lt;code&gt;bootstrap.py&lt;/code&gt; from the repository into a cell. Running this script will bootstrap Polaris by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Creating two catalogs: &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;warehouse&lt;/code&gt;, that point to MinIO buckets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Defining a principal with access credentials.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Assigning roles and granting full permissions to that principal.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When the script completes, it prints a ready-to-use Spark configuration block with all the connection details. You’ll use that configuration in the next section to create and manage Iceberg tables through Polaris.&lt;/p&gt;
&lt;h2&gt;Creating Iceberg Tables&lt;/h2&gt;
&lt;p&gt;With Polaris bootstrapped and Spark connected, you’re ready to start working with Iceberg tables. The tables you create will live in the &lt;code&gt;polaris.db&lt;/code&gt; namespace, with their data stored in your MinIO buckets. All catalog and permission management will happen automatically through Polaris.&lt;/p&gt;
&lt;p&gt;Before you begin creating tables, make sure Spark is configured to connect to Polaris. When you ran &lt;code&gt;bootstrap.py&lt;/code&gt;, the script printed out a Spark configuration block similar to the example below. This block contains the packages, catalog URI, warehouse name, and your principal’s credentials. Copy this block into a cell in your Jupyter Notebook and run it to initialize your Spark session.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Spark configuration for catalog: lakehouse
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .config(&amp;quot;spark.jars.packages&amp;quot;, &amp;quot;org.apache.polaris:polaris-spark-3.5_2.13:1.1.0-incubating,org.apache.iceberg:iceberg-aws-bundle:1.10.0,io.delta:delta-spark_2.12:3.3.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.spark_catalog&amp;quot;, &amp;quot;org.apache.spark.sql.delta.catalog.DeltaCatalog&amp;quot;)
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris&amp;quot;, &amp;quot;org.apache.polaris.spark.SparkCatalog&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.uri&amp;quot;, &amp;quot;http://polaris:8181/api/catalog&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.warehouse&amp;quot;, &amp;quot;lakehouse&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.credential&amp;quot;, &amp;quot;{client_id}:{client_secret}&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.scope&amp;quot;, &amp;quot;PRINCIPAL_ROLE:ALL&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation&amp;quot;, &amp;quot;vended-credentials&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.token-refresh-enabled&amp;quot;, &amp;quot;true&amp;quot;)
    .getOrCreate())

spark.sql(&amp;quot;CREATE NAMESPACE IF NOT EXISTS polaris.db&amp;quot;).show()
spark.sql(&amp;quot;CREATE TABLE IF NOT EXISTS polaris.db.example (name STRING)&amp;quot;).show()
spark.sql(&amp;quot;INSERT INTO polaris.db.example VALUES (&apos;example value&apos;)&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.example&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;client_id&lt;/code&gt; and &lt;code&gt;client_secret&lt;/code&gt; values should be filled in with the code printed at the end of running your bootstrap script. Once the Spark session starts, you’ll be able to issue SQL commands directly against Polaris.&lt;/p&gt;
&lt;h3&gt;Creating a Basic Table&lt;/h3&gt;
&lt;p&gt;Start by setting your working namespace and creating a simple unpartitioned table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# eliminates the need to prefix table names with the namespace polaris.db
spark.sql(&amp;quot;USE polaris.db&amp;quot;)

spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE customers (
    id INT,
    name STRING,
    city STRING
)
USING iceberg
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a new Iceberg table tracked by Polaris. You can confirm its existence by listing all tables in the namespace:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SHOW TABLES IN polaris.db&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now open the MinIO console at &lt;code&gt;http://localhost:9001&lt;/code&gt; (admin/password are the credentials) and explore the lakehouse bucket - you’ll see a new folder structure created for your table. This directory contains the Parquet data files and the metadata that Polaris manages.&lt;/p&gt;
&lt;h3&gt;Partitioned Tables&lt;/h3&gt;
&lt;p&gt;Partitioning helps improve performance by organizing data into logical groups. Iceberg’s partition transforms let you define flexible strategies without depending on directory names.&lt;/p&gt;
&lt;p&gt;Partition by a single column:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE polaris.db.sales (
    sale_id INT,
    product STRING,
    quantity INT,
    city STRING
)
USING iceberg
PARTITIONED BY (city)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partition by time:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE polaris.db.orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    total DECIMAL(10,2)
)
USING iceberg
PARTITIONED BY (months(order_date))
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partition by hash buckets for even data distribution:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE polaris.db.transactions (
    txn_id BIGINT,
    user_id BIGINT,
    amount DOUBLE
)
USING iceberg
PARTITIONED BY (bucket(8, user_id))
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each strategy changes how Iceberg organizes data, but all are tracked as metadata: not directories, making future changes safe and reversible.&lt;/p&gt;
&lt;p&gt;After creating your tables, return to the MinIO console to explore the results. You’ll notice new directories and metadata files representing the structure of each table. These files are created and tracked automatically by Polaris, ensuring that every write, update, and schema change remains consistent across all engines that connect to the catalog.&lt;/p&gt;
&lt;h2&gt;Inserting Data&lt;/h2&gt;
&lt;p&gt;Once your tables are created, you can begin inserting and modifying data through Spark. Every write operation, whether it’s an insert, update, or delete, creates a new &lt;strong&gt;snapshot&lt;/strong&gt; in Iceberg. Each snapshot represents a consistent view of your table at a specific point in time and is recorded in Polaris’s metadata catalog.&lt;/p&gt;
&lt;p&gt;Start with a simple insert:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
INSERT INTO polaris.db.customers VALUES
(1, &apos;Alice&apos;, &apos;New York&apos;),
(2, &apos;Bob&apos;, &apos;Chicago&apos;),
(3, &apos;Carla&apos;, &apos;Boston&apos;)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After this insert, open the MinIO console and look inside the lakehouse bucket under polaris/db/customers. You’ll see a new folder structure containing Parquet files and Iceberg metadata files (metadata.json, snapshots, and manifests). Each write creates new files rather than overwriting existing ones, which is how Iceberg maintains atomic transactions and rollback capabilities.&lt;/p&gt;
&lt;h3&gt;Inserting into Partitioned Tables&lt;/h3&gt;
&lt;p&gt;If you created partitioned tables earlier, Iceberg will automatically place data into the correct partitions based on your table definition:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
INSERT INTO polaris.db.sales VALUES
(101, &apos;Laptop&apos;, 5, &apos;New York&apos;),
(102, &apos;Tablet&apos;, 3, &apos;Boston&apos;),
(103, &apos;Phone&apos;, 7, &apos;Chicago&apos;)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To confirm partitioning, you can check MinIO. Each partition value (in this case, city) will have its own subdirectory. Iceberg manages these directories automatically through metadata, keeping partitioning invisible to end users.&lt;/p&gt;
&lt;h3&gt;Working with Larger Datasets&lt;/h3&gt;
&lt;p&gt;For larger datasets, you can also write directly from a DataFrame:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;data = [(201, &apos;Monitor&apos;, 2, &apos;Denver&apos;),
        (202, &apos;Keyboard&apos;, 10, &apos;Austin&apos;)]

df = spark.createDataFrame(data, [&apos;sale_id&apos;, &apos;product&apos;, &apos;quantity&apos;, &apos;city&apos;])
df.writeTo(&amp;quot;polaris.db.sales&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This method is efficient for batch operations and ensures your Spark DataFrames integrate cleanly with Iceberg’s transaction system.&lt;/p&gt;
&lt;p&gt;Each time you perform a write, Polaris updates the catalog with a new snapshot ID. These snapshots allow you to query your table as it existed at any point in time, a capability you’ll explore later in the section on time travel.&lt;/p&gt;
&lt;p&gt;For now, review the lakehouse bucket in MinIO after each insert to see how Iceberg adds new Parquet and metadata files. Each transaction tells a story of how the table evolves over time, tracked and governed by Polaris.&lt;/p&gt;
&lt;h2&gt;Update, Delete, and Merge Into&lt;/h2&gt;
&lt;p&gt;Apache Iceberg provides full ACID transaction support, allowing you to update, delete, and merge data safely. Each of these operations creates a new snapshot while preserving older versions of the table, giving you consistent rollback and auditing capabilities. Polaris tracks these changes in its catalog so that every engine accessing the table sees a consistent state.&lt;/p&gt;
&lt;h3&gt;Updating Data&lt;/h3&gt;
&lt;p&gt;Use &lt;code&gt;UPDATE&lt;/code&gt; to modify existing records. For example, if one of your customers relocates:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
UPDATE polaris.db.customers
SET city = &apos;San Francisco&apos;
WHERE name = &apos;Alice&apos;
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This statement creates a new snapshot that replaces the affected rows with updated data. Iceberg performs this by rewriting only the data files that contain the changed rows, which keeps transactions efficient even at scale.&lt;/p&gt;
&lt;h3&gt;Deleting Data&lt;/h3&gt;
&lt;p&gt;You can delete records using a standard &lt;code&gt;DELETE&lt;/code&gt; statement:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
DELETE FROM polaris.db.customers
WHERE name = &apos;Bob&apos;
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After running this command, open the MinIO console and look at the customers directory in the lakehouse bucket. You’ll notice new Parquet and metadata files have appeared, Iceberg never mutates existing files. Instead, it writes new ones and updates the catalog’s snapshot metadata through Polaris.&lt;/p&gt;
&lt;h3&gt;Merging Data (Upserts)&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;MERGE INTO&lt;/code&gt; command allows you to perform upserts, merging new records with existing data based on a matching key. This is especially useful when syncing incremental updates from another source.&lt;/p&gt;
&lt;p&gt;First, create a temporary table or view that holds your new data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE OR REPLACE TEMP VIEW updates AS
SELECT 1 AS id, &apos;Alice&apos; AS name, &apos;Seattle&apos; AS city
UNION ALL
SELECT 4 AS id, &apos;Dana&apos; AS name, &apos;Austin&apos; AS city
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then merge it into your main table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
MERGE INTO polaris.db.customers AS target
USING updates AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET city = source.city
WHEN NOT MATCHED THEN INSERT *
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After the merge completes, Polaris will record a new snapshot in the catalog. You can query the customers.history or customers.snapshots metadata tables to see when and how the change occurred.&lt;/p&gt;
&lt;p&gt;Each of these operations, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, and &lt;code&gt;MERGE INTO&lt;/code&gt;, produces new files in MinIO and new snapshots in Polaris. This versioned structure ensures your tables remain fully auditable. Take a moment to check the lakehouse bucket again after running each command. You’ll see Iceberg’s design in action: immutable data files, evolving metadata, and transparent version control, all orchestrated through Polaris.&lt;/p&gt;
&lt;h2&gt;Altering Partition Scheme&lt;/h2&gt;
&lt;p&gt;Over time, your table’s partitioning strategy may need to change as data grows or query patterns evolve. Apache Iceberg allows you to alter partition schemes safely, without rewriting existing files. This flexibility is one of Iceberg’s biggest advantages over traditional data lake formats. All changes are tracked by Polaris, ensuring that the catalog always reflects the current partition structure.&lt;/p&gt;
&lt;p&gt;Suppose your &lt;code&gt;sales&lt;/code&gt; table is currently partitioned by city. If queries start filtering by &lt;code&gt;product&lt;/code&gt; instead, you can modify the table’s partitioning to better suit that use case. Start by dropping the old partition field:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
ALTER TABLE polaris.db.sales
DROP PARTITION FIELD city
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add a new partition field:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
ALTER TABLE polaris.db.sales
ADD PARTITION FIELD bucket(8, product)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This change affects only future writes. Existing data remains organized by the previous partition scheme, while new records follow the new one. Iceberg’s metadata model keeps track of both versions, so queries continue to return complete results without manual migration.&lt;/p&gt;
&lt;p&gt;To verify your table’s current partitioning, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SHOW PARTITIONS polaris.db.sales&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also view the table’s partition history through the metadata tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.partitions&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After altering partition fields, try inserting new records and observe how Iceberg places them into new directories in MinIO. Open the lakehouse bucket in the MinIO console, navigate to your sales folder, and you’ll see both the old and new partition structures coexisting under the same table. Polaris ensures the catalog references all of them correctly.&lt;/p&gt;
&lt;p&gt;This feature makes partition evolution seamless. You can adapt to new data patterns or performance needs without downtime, data duplication, or complex ETL steps. In the next section, you’ll learn how to explore Iceberg’s built-in metadata tables and use time travel to query historical versions of your data.&lt;/p&gt;
&lt;h2&gt;Metadata Tables and Time Travel&lt;/h2&gt;
&lt;p&gt;Apache Iceberg doesn’t just store data - it stores the entire history of your data. Every write operation creates a new snapshot, and every snapshot is tracked in the table’s metadata. These metadata tables give you full visibility into how your data changes over time. Because Polaris manages the catalog, you can query these tables from any engine that connects to it, ensuring a unified and governed view of your data lifecycle.&lt;/p&gt;
&lt;h3&gt;Exploring Metadata Tables&lt;/h3&gt;
&lt;p&gt;Each Iceberg table automatically includes several metadata tables that you can query just like normal tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;history&lt;/strong&gt; – shows when snapshots were created.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;snapshots&lt;/strong&gt; – lists snapshot IDs and timestamps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;files&lt;/strong&gt; – lists all data and manifest files in each snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;manifests&lt;/strong&gt; – details how files are grouped and filtered.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can explore them with Spark SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.history&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.snapshots&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.files&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These tables reveal every version of your dataset, what files were written, when they were created, and by which operation. You can use this information for auditing, debugging, or optimizing table performance.&lt;/p&gt;
&lt;h3&gt;Querying Past Versions with Time Travel&lt;/h3&gt;
&lt;p&gt;Because Iceberg stores all historical snapshots, you can query data as it existed at a specific point in time. You can travel through time using either a snapshot ID or a timestamp.&lt;/p&gt;
&lt;p&gt;First, identify a snapshot ID from the snapshots table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT snapshot_id, committed_at FROM polaris.db.sales.snapshots&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then query that version:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.read.option(&amp;quot;snapshot-id&amp;quot;, &amp;quot;&amp;lt;snapshot_id&amp;gt;&amp;quot;).table(&amp;quot;polaris.db.sales&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, you can query the table as it existed at a given timestamp:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.read.option(&amp;quot;as-of-timestamp&amp;quot;, &amp;quot;2025-10-10T12:00:00&amp;quot;).table(&amp;quot;polaris.db.sales&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ability to reproduce historical states makes Iceberg ideal for debugging ETL processes, reproducing analytics, or auditing compliance-related datasets.&lt;/p&gt;
&lt;h3&gt;Seeing It in MinIO&lt;/h3&gt;
&lt;p&gt;Each time you insert, update, or delete data, Iceberg records a new snapshot. Open the lakehouse bucket in MinIO and navigate through your table directories - you’ll notice subdirectories under metadata/ representing each snapshot and manifest. Every change to your data produces new metadata and data files, which together describe the complete history of your table.&lt;/p&gt;
&lt;p&gt;Iceberg’s metadata and time travel capabilities, combined with Polaris’s catalog management, give you full traceability and reproducibility. In the next section, you’ll learn how to keep your tables healthy by compacting small files and expiring old snapshots.&lt;/p&gt;
&lt;h2&gt;Compaction and Snapshot Expiration&lt;/h2&gt;
&lt;p&gt;As you run inserts, updates, and merges, Iceberg continuously creates new data and metadata files. Over time, this can lead to many small files and obsolete snapshots. To maintain performance and control storage costs, Iceberg provides built-in maintenance operations for compaction and snapshot expiration. With Polaris managing the catalog, these optimizations remain consistent and trackable across all compute engines that access your tables.&lt;/p&gt;
&lt;h3&gt;Compacting Small Files&lt;/h3&gt;
&lt;p&gt;Small files are common in streaming or frequent batch ingestion workflows. Iceberg can merge them into fewer, larger files using the &lt;code&gt;rewrite_data_files&lt;/code&gt; procedure. This reduces overhead during query planning and execution.&lt;/p&gt;
&lt;p&gt;Run the following command from Spark to compact your table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.rewrite_data_files(&apos;polaris.db.sales&apos;)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also target specific partitions or filter files by size:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.rewrite_data_files(
  table =&amp;gt; &apos;polaris.db.sales&apos;,
  options =&amp;gt; map(&apos;min-input-files&apos;, &apos;4&apos;, &apos;max-concurrent-rewrites&apos;, &apos;2&apos;)
)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After compaction, check your lakehouse bucket in MinIO. You’ll notice fewer Parquet files, each larger in size. Iceberg automatically updates manifests and metadata files so that queries continue to return accurate results with better performance.&lt;/p&gt;
&lt;h3&gt;Expiring Old Snapshots&lt;/h3&gt;
&lt;p&gt;Every Iceberg operation creates a snapshot. Over time, unused snapshots can accumulate, consuming metadata space and storage. Iceberg allows you to remove these safely using the expire_snapshots procedure.&lt;/p&gt;
&lt;p&gt;For example, to remove snapshots older than seven days:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.expire_snapshots(
  table =&amp;gt; &apos;polaris.db.sales&apos;,
  older_than =&amp;gt; TIMESTAMPADD(DAY, -7, CURRENT_TIMESTAMP)
)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also specify how many snapshots to retain regardless of age:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.expire_snapshots(
  table =&amp;gt; &apos;polaris.db.sales&apos;,
  retain_last =&amp;gt; 5
)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Polaris automatically tracks the catalog state after expiration, ensuring that all compute engines accessing the table remain synchronized with the current set of snapshots.&lt;/p&gt;
&lt;h3&gt;Monitoring with Metadata Tables&lt;/h3&gt;
&lt;p&gt;After compaction or expiration, you can verify changes using the metadata tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.snapshots&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.manifests&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You’ll see fewer manifests and snapshots, confirming that Iceberg has reclaimed space and simplified query planning.&lt;/p&gt;
&lt;p&gt;Maintenance operations like compaction and snapshot expiration help keep your Iceberg tables fast and cost-efficient. Combined with Polaris’s centralized catalog, these operations stay consistent across all connected engines. Whether you’re using Spark, Dremio, Trino, or Flink, Polaris ensures a single source of truth for your Iceberg metadata, making performance optimization and governance effortless.&lt;/p&gt;
&lt;h2&gt;Writing Efficiently to Apache Iceberg with Spark&lt;/h2&gt;
&lt;p&gt;When working with Apache Iceberg tables in Spark, how you write data has a major impact on performance, metadata growth, and maintenance frequency. Iceberg is designed for incremental writes and schema evolution, but inefficient write patterns: like frequent small updates or poor partitioning, can lead to excessive snapshots and small files. By tuning Spark and table-level settings, you can reduce the need for costly compaction and keep your tables query-ready.&lt;/p&gt;
&lt;h3&gt;Optimize File Size and Shuffle Configuration&lt;/h3&gt;
&lt;p&gt;Each write produces data files that Spark generates in parallel tasks. If your partitions are too small or the number of shuffle tasks is too high, Spark creates many tiny files, increasing metadata overhead and slowing queries. To control this, adjust Spark’s shuffle and output configurations before writing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.conf.set(&amp;quot;spark.sql.shuffle.partitions&amp;quot;, 8)
spark.conf.set(&amp;quot;spark.sql.files.maxRecordsPerFile&amp;quot;, 5_000_000)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These settings reduce the number of output files per job and encourage larger Parquet files (typically &lt;code&gt;128–512 MB&lt;/code&gt; each). You can also call &lt;code&gt;.coalesce()&lt;/code&gt; or &lt;code&gt;.repartition()&lt;/code&gt; before writes to further control file output:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.coalesce(8).writeTo(&amp;quot;polaris.db.sales&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Balanced partitioning and file sizing keep your table fast and avoid unnecessary metadata bloat.&lt;/p&gt;
&lt;h3&gt;Use Table Properties to Guide Iceberg Behavior&lt;/h3&gt;
&lt;p&gt;Iceberg provides table-level configuration options that influence how data is written, compacted, and validated. You can define them during table creation or later using &lt;code&gt;ALTER TABLE&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE polaris.db.sales (
  id BIGINT,
  region STRING,
  sale_date DATE,
  amount DOUBLE
)
USING iceberg
PARTITIONED BY (days(sale_date))
TBLPROPERTIES (
  &apos;write.target-file-size-bytes&apos;=&apos;268435456&apos;,  -- 256 MB target file size
  &apos;commit.manifest-merge.enabled&apos;=&apos;true&apos;,       -- reduces manifest churn
  &apos;write.distribution-mode&apos;=&apos;hash&apos;,             -- distributes data evenly
  &apos;write.merge.mode&apos;=&apos;copy-on-write&apos;            -- ensures clean updates
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also modify these settings later:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE polaris.db.sales SET TBLPROPERTIES (
  &apos;write.target-file-size-bytes&apos;=&apos;536870912&apos;  -- 512 MB
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Setting appropriate table properties ensures consistent behavior across all engines: Spark, Dremio, or Flink, that share your Polaris catalog.&lt;/p&gt;
&lt;h3&gt;Batch and Append Data Strategically&lt;/h3&gt;
&lt;p&gt;Each write in Iceberg creates a new snapshot. If your application writes too frequently (e.g., per record or small microbatch), metadata grows quickly and queries slow down. Instead, buffer data into larger batches before committing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;batch_df.writeTo(&amp;quot;polaris.db.sales&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you need streaming ingestion, tune the microbatch trigger interval and commit size. A five-minute trigger often balances latency and table stability better than writing every few seconds.&lt;/p&gt;
&lt;p&gt;For update-heavy workloads, consider using Merge-Into operations periodically rather than constant row-level updates:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO polaris.db.sales t
USING updates u
ON t.id = u.id
WHEN MATCHED THEN UPDATE SET amount = u.amount
WHEN NOT MATCHED THEN INSERT *
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This avoids snapshot sprawl and makes compaction less frequent.&lt;/p&gt;
&lt;h3&gt;Align Partitioning with Query Patterns&lt;/h3&gt;
&lt;p&gt;Good partitioning reduces the number of files scanned per query. Avoid partitioning by high-cardinality columns like &lt;code&gt;user_id&lt;/code&gt;. Instead, use transforms that group data efficiently:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE polaris.db.sales REPLACE PARTITION FIELD sale_date WITH days(sale_date)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or combine multiple transforms for balance:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE polaris.db.sales (
  id BIGINT,
  region STRING,
  sale_date DATE,
  amount DOUBLE
)
USING iceberg
PARTITIONED BY (bucket(8, region), days(sale_date))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These partitioning rules make pruning effective and improve both reads and writes.&lt;/p&gt;
&lt;h3&gt;Tune Commit and Validation Settings&lt;/h3&gt;
&lt;p&gt;For large write jobs, commit coordination and validation can also affect performance. Iceberg supports asynchronous manifest merging and snapshot cleanup to reduce contention:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE polaris.db.sales SET TBLPROPERTIES (
  &apos;commit.manifest-merge.enabled&apos;=&apos;true&apos;,
  &apos;commit.retry.num-retries&apos;=&apos;5&apos;,
  &apos;write.distribution-mode&apos;=&apos;hash&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These settings help large concurrent writers (for example, in Spark and Flink) commit safely to the same table without conflicts.&lt;/p&gt;
&lt;p&gt;Efficient Iceberg write patterns come from tuning Spark and table properties together. Use larger file targets, consistent partitioning, and controlled batch sizes to minimize small files and snapshot churn. By applying these strategies, your Iceberg tables will stay lean and performant, reducing the need for manual compaction or cleanup. Combined with Apache Polaris, your catalog enforces consistent governance, authentication, and metadata management across every compute engine in your lakehouse.&lt;/p&gt;
&lt;h2&gt;12. Understanding How Polaris Manages Your Iceberg Tables&lt;/h2&gt;
&lt;p&gt;Once you have optimized your write strategy, it’s worth understanding what happens behind the scenes when you write data into Iceberg tables through Apache Polaris. Polaris acts as a centralized catalog: responsible for managing all metadata about your tables, snapshots, and permissions, ensuring that every write or read operation is consistent across tools like Spark, Dremio, Trino, and Flink.&lt;/p&gt;
&lt;p&gt;When Spark writes to an Iceberg table using Polaris, the process goes beyond simply saving files to MinIO or S3. Each commit updates a &lt;strong&gt;snapshot&lt;/strong&gt;: a precise record of table state including data files, manifests, and partition metadata. Polaris stores the metadata pointers, enforces ACID guarantees, and validates that every write operation maintains table consistency.&lt;/p&gt;
&lt;h3&gt;Coordinating Metadata and Commits&lt;/h3&gt;
&lt;p&gt;Each write to an Iceberg table involves several steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Spark writes data files (usually in Parquet format) to the storage layer, such as MinIO.&lt;/li&gt;
&lt;li&gt;Spark generates a manifest list describing these new data files.&lt;/li&gt;
&lt;li&gt;The Iceberg REST client, through Polaris, updates the catalog’s metadata location and commits the new snapshot.&lt;/li&gt;
&lt;li&gt;Polaris enforces isolation and conflict detection to ensure concurrent writers don’t overwrite each other’s work.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because Polaris manages these metadata transactions centrally, it becomes the single source of truth for all engines. This makes cross-engine interoperability reliable - Spark can write data, and Dremio or Trino can query it immediately without any manual refresh.&lt;/p&gt;
&lt;h3&gt;Governance and Security&lt;/h3&gt;
&lt;p&gt;Polaris also introduces a security layer around Iceberg. Instead of embedding access keys or S3 credentials in your Spark jobs, Polaris can &lt;strong&gt;vend temporary credentials&lt;/strong&gt; that enforce fine-grained access control. Each principal and catalog role determines what operations are allowed, ensuring that users and jobs interact only with the tables they are permitted to modify or query.&lt;/p&gt;
&lt;p&gt;This approach decouples data governance from compute infrastructure. You can manage permissions, audit access, and rotate credentials: all directly through Polaris, while still using open data lakehouse standards like Apache Iceberg.&lt;/p&gt;
&lt;h3&gt;Automatic Table Optimization in Dremio&lt;/h3&gt;
&lt;p&gt;If you use Dremio’s integrated catalog (built on Polaris), you also gain automated table optimization. Dremio monitors data size, file counts, and snapshot churn, then automatically runs compaction and metadata cleanup as needed. It maintains your Iceberg tables in an optimized state without requiring manual Spark procedures.&lt;/p&gt;
&lt;p&gt;That means you can focus on analytics, while Dremio and Polaris handle governance, credential management, and metadata consistency across all your compute platforms.&lt;/p&gt;
&lt;p&gt;With this understanding, you now have a complete end-to-end view of how Apache Spark and Apache Polaris work together to maintain a modern, open lakehouse. From efficient write strategies to managed metadata and automated optimization, you can confidently scale your Iceberg data platform knowing it’s governed, interoperable, and future-proof.&lt;/p&gt;
&lt;h2&gt;Next Steps and Expanding Your Lakehouse&lt;/h2&gt;
&lt;p&gt;Now that you’ve successfully set up Apache Polaris with Spark and Iceberg on your local machine, you’ve built a foundation for exploring the broader lakehouse ecosystem. This environment not only lets you understand Iceberg’s core table mechanics but also shows how a catalog like Polaris centralizes governance, metadata, and access control - key components of an interoperable lakehouse architecture.&lt;/p&gt;
&lt;h3&gt;Connect More Compute Engines&lt;/h3&gt;
&lt;p&gt;Polaris is designed to work seamlessly across multiple compute engines. Once your Iceberg tables are registered in Polaris, you can connect tools such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; – Query and optimize Iceberg tables visually through its integrated Polaris-based catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trino&lt;/strong&gt; – Use Polaris as a REST-based catalog for federated queries across your data lake.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink&lt;/strong&gt; – Stream data into Iceberg tables managed by Polaris for real-time analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckDB&lt;/strong&gt; or &lt;strong&gt;Python (PyIceberg)&lt;/strong&gt; – Interact directly with Iceberg tables for lightweight local exploration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these engines communicates through the same Polaris REST interface, ensuring that all metadata and access control remain consistent, no matter where you query from.&lt;/p&gt;
&lt;h3&gt;Experiment with Advanced Iceberg Features&lt;/h3&gt;
&lt;p&gt;Once you’re comfortable with the basics, try exploring Iceberg’s advanced capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt; – Add, rename, or delete columns without rewriting data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-Level Deletes&lt;/strong&gt; – Use deletion vectors for efficient, fine-grained record removal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Branching and Tagging&lt;/strong&gt; – Experiment safely with data changes using versioned metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot Isolation&lt;/strong&gt; – Test concurrent writes to understand Iceberg’s transaction model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These features are fully tracked by Polaris, giving you a reliable, auditable history of every change.&lt;/p&gt;
&lt;h3&gt;Extend with Automation and Orchestration&lt;/h3&gt;
&lt;p&gt;You can also automate your setup and maintenance workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;Airflow&lt;/strong&gt; or &lt;strong&gt;Cron&lt;/strong&gt; to run the &lt;code&gt;bootstrap.py&lt;/code&gt; script on a schedule, ensuring consistent initialization of catalogs and principals.&lt;/li&gt;
&lt;li&gt;Create periodic &lt;strong&gt;compaction&lt;/strong&gt; or &lt;strong&gt;snapshot expiration&lt;/strong&gt; jobs using Spark SQL.&lt;/li&gt;
&lt;li&gt;Deploy your Polaris setup in &lt;strong&gt;Kubernetes&lt;/strong&gt; using Helm or Docker Compose for multi-user testing environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Prepare for Cloud or Hybrid Deployment&lt;/h3&gt;
&lt;p&gt;The setup you’ve built locally with MinIO can easily extend to real cloud storage systems. Replace your MinIO endpoint with S3, GCS, or Azure Blob credentials, and Polaris will manage your Iceberg tables just as before, using the same metadata model and APIs.&lt;/p&gt;
&lt;p&gt;This local-to-cloud continuity is one of the greatest advantages of Iceberg and Polaris: your data architecture can scale from a personal laptop demo to a full production lakehouse without refactoring or vendor lock-in.&lt;/p&gt;
&lt;h3&gt;Wrapping Up&lt;/h3&gt;
&lt;p&gt;You’ve now seen how Apache Iceberg, Apache Polaris, and Apache Spark work together to form a robust, open lakehouse. Through this hands-on setup, you’ve learned how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Write and optimize Iceberg tables in Spark.&lt;/li&gt;
&lt;li&gt;Manage metadata, catalogs, and access through Polaris.&lt;/li&gt;
&lt;li&gt;Explore advanced Iceberg features safely and efficiently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For larger-scale deployments: or if you want automated optimization, integrated governance, and performance acceleration, explore &lt;strong&gt;Dremio’s Intelligent Lakehouse Platform&lt;/strong&gt;, which builds directly on Apache Polaris and Iceberg to deliver a unified, self-service analytics experience.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The State of Apache Iceberg v4 - October 2025 Edition</title><link>https://iceberglakehouse.com/posts/2025-10-apache-iceberg-v4-october-2025/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-apache-iceberg-v4-october-2025/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-10-apache-iceberg-v4/).

**Get Data...</description><pubDate>Tue, 14 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-10-apache-iceberg-v4/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Apache Iceberg has come a long way since its early days of bringing reliable ACID transactions and schema evolution to the data lake. It helped teams move beyond brittle Hive tables and built the foundation for modern lakehouse architectures. But with wider adoption came new challenges - especially as workloads shifted from batch-heavy pipelines to streaming ingestion, faster commits, and more interactive use cases.&lt;/p&gt;
&lt;p&gt;That pressure has exposed some cracks in the foundation. Write-heavy applications hit metadata bottlenecks. Query planners struggle with inefficient stats. Teams managing large tables face complex migrations due to rigid path references.&lt;/p&gt;
&lt;p&gt;The Apache Iceberg community has responded with a set of focused, forward-looking proposals that make up the v4 specification. These aren’t just incremental tweaks. They represent a clear architectural shift toward scalability, operational simplicity, and real-time readiness.&lt;/p&gt;
&lt;p&gt;In this post, we’ll walk through the key features proposed for Iceberg v4, why they matter, and what they mean for data engineers, architects, and teams building at scale.&lt;/p&gt;
&lt;h2&gt;The New Iceberg Vision: Performance Meets Portability&lt;/h2&gt;
&lt;p&gt;Apache Iceberg was initially built for reliable batch analytics on cloud object storage. It solved core problems like schema evolution, snapshot isolation, and data consistency across distributed files. That foundation made it a favorite for building open data lakehouses.&lt;/p&gt;
&lt;p&gt;But today’s data platforms are evolving fast. Teams are mixing streaming and batch. Ingest rates are higher. Table sizes are bigger. Query expectations are more demanding. Managing metadata at scale has become one of the biggest friction points.&lt;/p&gt;
&lt;p&gt;The proposals in Iceberg v4 address these shifts head-on. Together, they aim to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduce write overhead&lt;/strong&gt; so commits scale with ingestion speed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improve query planning&lt;/strong&gt; by making metadata easier to scan and use&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplify operations&lt;/strong&gt; like moving, cloning, or backing up tables&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, Iceberg is being re-tuned for modern workloads - ones that demand both speed and flexibility. The v4 changes aren’t just about performance. They’re about making Iceberg easier to run, easier to optimize, and better suited for the next generation of data systems.&lt;/p&gt;
&lt;h2&gt;Proposal 1: Single-File Commits – Cutting Down Metadata Overhead&lt;/h2&gt;
&lt;p&gt;Every commit to an Iceberg table today creates at least two new metadata files: one for the updated manifest list, and another for any changed manifests. In fast-moving environments: like streaming ingestion or micro-batch pipelines, this adds up quickly.&lt;/p&gt;
&lt;p&gt;The result? Write amplification. For every small data change, there’s a burst of I/O to update metadata. Over time, this leads to thousands of small metadata files, bloated storage, and a slowdown in commit throughput. Teams often have to schedule compaction jobs to clean up the data.&lt;/p&gt;
&lt;p&gt;The v4 proposal introduces &lt;strong&gt;Single-File Commits&lt;/strong&gt;, a new way to consolidate all metadata changes into a single file per commit. This reduces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The number of file system operations per commit&lt;/li&gt;
&lt;li&gt;The coordination overhead for concurrent writers&lt;/li&gt;
&lt;li&gt;The need for frequent compaction&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By minimizing I/O and simplifying commit logic, this change unlocks faster ingestion and makes Iceberg friendlier to real-time workflows. It also means fewer moving parts to manage and fewer edge cases to debug in production.&lt;/p&gt;
&lt;h2&gt;Proposal 2: Parquet for Metadata – Smarter Query Planning&lt;/h2&gt;
&lt;p&gt;Today, Iceberg stores metadata files: like manifests and manifest lists, in &lt;strong&gt;Apache Avro&lt;/strong&gt;, a row-based format. While this made sense early on, it’s become a bottleneck for query performance.&lt;/p&gt;
&lt;p&gt;Why? Because most query engines don’t need every field in the metadata. For example, if a planner wants to filter files based on a column’s min and max values, it only needs that one field. But with Avro, it has to read and deserialize entire rows just to access a few columns.&lt;/p&gt;
&lt;p&gt;The proposed change in Iceberg v4 is to &lt;strong&gt;use Parquet instead of Avro&lt;/strong&gt; for metadata files. Since Parquet is a columnar format, engines can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read only the fields they need&lt;/li&gt;
&lt;li&gt;Skip over irrelevant parts of the file&lt;/li&gt;
&lt;li&gt;Load metadata faster and use less memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This shift isn’t just about speed - it enables smarter planning. Engines can project just the stats they care about, sort and filter more effectively, and better optimize execution plans. It’s a small architectural change with a big ripple effect across the query lifecycle.&lt;/p&gt;
&lt;h2&gt;Proposal 3: Column Statistics Overhaul – Better Skipping, Smarter Queries&lt;/h2&gt;
&lt;p&gt;Metadata isn&apos;t just about file paths - it&apos;s also about understanding what’s inside each file. Iceberg uses column-level statistics to help query engines skip files that don’t match filter conditions. But the current stats format has limitations that hold back performance.&lt;/p&gt;
&lt;p&gt;Right now, statistics are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flat and untyped, with no indication of data type&lt;/li&gt;
&lt;li&gt;Stored as generic key-value pairs&lt;/li&gt;
&lt;li&gt;Lacking detail on things like null counts or nested fields&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These gaps make it hard for query planners to fully optimize their logic. For example, it&apos;s difficult to distinguish between a missing value and a null, or to reason about nested data structures like structs and arrays.&lt;/p&gt;
&lt;p&gt;The v4 spec proposes a &lt;strong&gt;redesigned statistics format&lt;/strong&gt; with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Type information for every stat&lt;/li&gt;
&lt;li&gt;Projectable structures for selective reads&lt;/li&gt;
&lt;li&gt;Support for more detailed metrics, including null counts and nested fields&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This richer structure enables more precise file pruning and better cost-based optimization. Engines can make smarter decisions about which files to read and which filters to push down - leading to faster queries, less I/O, and improved overall performance.&lt;/p&gt;
&lt;h2&gt;Proposal 4: Relative Paths – Making Tables Portable Again&lt;/h2&gt;
&lt;p&gt;In current versions of Iceberg, metadata files store &lt;strong&gt;absolute file paths&lt;/strong&gt;. That might seem fine at first - until you try to move a table.&lt;/p&gt;
&lt;p&gt;If you change storage accounts, rename a bucket, or migrate between environments, every path in every metadata file becomes invalid. Fixing that means scanning and rewriting all metadata: an expensive, error-prone operation that often requires a distributed job.&lt;/p&gt;
&lt;p&gt;The v4 proposal introduces support for &lt;strong&gt;relative paths&lt;/strong&gt; in metadata. Instead of locking a table to a fixed storage location, file references are stored relative to a base URI defined in the table metadata.&lt;/p&gt;
&lt;p&gt;This change unlocks several real-world benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simpler migrations&lt;/strong&gt; across cloud regions or storage platforms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Easier disaster recovery&lt;/strong&gt; with portable backups&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Less brittle operations&lt;/strong&gt; when storage configurations evolve&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Relative paths decouple the logical structure of a table from its physical location. That means fewer rewrites, less maintenance overhead, and more flexibility when managing Iceberg tables at scale.&lt;/p&gt;
&lt;h2&gt;Iceberg’s Direction: Toward Operational Simplicity&lt;/h2&gt;
&lt;p&gt;Taken together, these proposals reflect a clear shift in how the Iceberg community is thinking about the format - not just as a technical layer, but as an operational foundation for modern data platforms.&lt;/p&gt;
&lt;p&gt;Here’s what’s changing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;From batch-first to real-time ready&lt;/strong&gt;: Single-file commits and smarter stats make Iceberg more suitable for streaming ingestion and low-latency use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;From fixed to flexible&lt;/strong&gt;: Relative paths reduce the coupling between metadata and storage, making operations like migration and backup less painful.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;From rigid to optimized&lt;/strong&gt;: Moving to columnar metadata and richer statistics gives query engines more room to optimize without heavy lifting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is Iceberg growing up.&lt;/p&gt;
&lt;p&gt;The format has always prioritized correctness and openness. Now it’s doubling down on speed, scalability, and ease of use - especially for teams managing hundreds or thousands of tables across dynamic environments.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re building AI pipelines, federated queries, or traditional dashboards, these changes aim to reduce the friction and complexity of working with large-scale tables. It’s about making Iceberg not just powerful, but practical.&lt;/p&gt;
&lt;h2&gt;A Glimpse Ahead: v4 in Context&lt;/h2&gt;
&lt;p&gt;Just a few months ago, Apache Iceberg v3 was approved, bringing meaningful improvements to the table format. That release introduced new data types, deletion vectors, and other enhancements that expanded what Iceberg can represent and how it supports evolving workloads.&lt;/p&gt;
&lt;p&gt;Right now, the ecosystem is heads-down implementing v3 features across engines, catalogs, and query layers. You’ll see more engines support features like row-level deletes and richer data modeling as v3 adoption matures.&lt;/p&gt;
&lt;p&gt;The proposals for v4 aren’t intended to replace that momentum - they build on it.&lt;/p&gt;
&lt;p&gt;Think of v3 as expanding what Iceberg can do. V4 focuses on how efficiently and cleanly it can perform the task. These early discussions around v4 offer a forward-looking roadmap for how Iceberg will continue to evolve - toward higher throughput, better portability, and more brilliant query performance.&lt;/p&gt;
&lt;p&gt;While these changes are still in the design and discussion phase, they signal where Iceberg is heading. For data teams investing in the lakehouse stack today, it’s reassuring that the foundation will only get stronger over time.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Ultimate Guide to Open Table Formats - Iceberg, Delta Lake, Hudi, Paimon, and DuckLake</title><link>https://iceberglakehouse.com/posts/2025-09-ultimate-guide-to-open-table-formats/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-ultimate-guide-to-open-table-formats/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-09-ultimate-table-format-guide/).

...</description><pubDate>Wed, 24 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-09-ultimate-table-format-guide/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/strong&gt;
&lt;strong&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Modern lakehouse stacks live or die by &lt;strong&gt;how&lt;/strong&gt; they manage tables on cheap, scalable object storage. That “how” is the job of &lt;strong&gt;open table formats&lt;/strong&gt;, the layer that turns piles of Parquet/ORC files into reliable, ACID-compliant &lt;strong&gt;tables&lt;/strong&gt; with schema evolution, time travel, and efficient query planning. If you’ve ever wrestled with brittle Hive tables, small-file explosions, or “append-only” lakes that can’t handle updates and deletes, you already know why this layer matters.&lt;/p&gt;
&lt;p&gt;In this guide, we’ll demystify the five formats you’re most likely to encounter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; - snapshot- and manifest–driven, engine-agnostic, fast for large-scale analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt; - transaction-log–based, deeply integrated with Spark/Databricks, strong batch/stream unification.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi&lt;/strong&gt; - built for upserts, deletes, and incremental processing; flexible COW/MOR modes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Paimon&lt;/strong&gt; - streaming-first with an LSM-like design for high-velocity updates and near-real-time reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake&lt;/strong&gt; - a fresh, catalog-centric approach that uses a relational database for metadata (SQL all the way down).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll start beginner-friendly, clarifying &lt;strong&gt;what&lt;/strong&gt; a table format is and &lt;strong&gt;why&lt;/strong&gt; it’s essential, then progressively dive into expert-level topics: &lt;strong&gt;metadata internals&lt;/strong&gt; (snapshots, logs, manifests, LSM levels), &lt;strong&gt;row-level change strategies&lt;/strong&gt; (COW, MOR, delete vectors), &lt;strong&gt;performance trade-offs&lt;/strong&gt;, &lt;strong&gt;ecosystem support&lt;/strong&gt; (Spark, Flink, Trino/Presto, DuckDB, warehouses), and &lt;strong&gt;adoption trends&lt;/strong&gt; you should factor into your roadmap.&lt;/p&gt;
&lt;p&gt;By the end, you’ll have a practical mental model to choose the right format for your workloads, whether you’re optimizing petabyte-scale analytics, enabling near-real-time CDC, or simplifying your metadata layer for developer velocity.&lt;/p&gt;
&lt;h2&gt;Why Open Table Formats Exist&lt;/h2&gt;
&lt;p&gt;Before diving into each format, it’s worth understanding &lt;em&gt;why&lt;/em&gt; open table formats became necessary in the first place.&lt;/p&gt;
&lt;p&gt;Traditional data lakes, built on raw files like CSV, JSON, or Parquet, were cheap and scalable, but brittle. They had no concept of &lt;strong&gt;transactions&lt;/strong&gt;, which meant if two jobs wrote data at the same time, you could easily end up with partial or corrupted results. Schema evolution was painful, renaming or reordering columns could break queries, and updating or deleting even a single row often meant rewriting entire partitions.&lt;/p&gt;
&lt;p&gt;Meanwhile, enterprises still needed &lt;strong&gt;database-like features&lt;/strong&gt;, updates, deletes, versioning, auditing, on their data lakes. That tension set the stage for open table formats. These formats layer &lt;strong&gt;metadata and transaction protocols&lt;/strong&gt; on top of files to give the data lake the brains of a database while keeping its open, flexible nature.&lt;/p&gt;
&lt;p&gt;In practice, open table formats deliver several critical capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Ensure reliability for concurrent reads and writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Add, drop, or rename fields without breaking downstream consumers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query data as it existed at a specific point in time for auditing or recovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Queries:&lt;/strong&gt; Push down filters and prune partitions/files using metadata rather than scanning everything.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-Level Mutations:&lt;/strong&gt; Support upserts, merges, and deletes on immutable storage layers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Engine Interoperability:&lt;/strong&gt; Enable the same table to be queried by Spark, Flink, Trino, Presto, DuckDB, warehouses, and more.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, table formats solve the “wild west of files” problem, turning data lakes into &lt;strong&gt;lakehouses&lt;/strong&gt; that balance scalability with structure. The differences among Iceberg, Delta, Hudi, Paimon, and DuckLake lie in &lt;em&gt;how&lt;/em&gt; they achieve this and &lt;em&gt;what trade-offs&lt;/em&gt; they make to optimize for batch, streaming, or simplicity.&lt;/p&gt;
&lt;p&gt;Next, we’ll walk through the &lt;strong&gt;history and evolution&lt;/strong&gt; of each format to see how these ideas took shape.&lt;/p&gt;
&lt;h2&gt;The Evolution of Open Table Formats&lt;/h2&gt;
&lt;p&gt;The journey of open table formats reflects the challenges companies faced as data lakes scaled from terabytes to petabytes. Each format emerged to solve specific pain points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Hudi (2016)&lt;/strong&gt; – Created at Uber to solve &lt;em&gt;freshness&lt;/em&gt; and &lt;em&gt;incremental ingestion&lt;/em&gt;. Hudi pioneered row-level upserts and deletes on data lakes, enabling near real-time pipelines on Hadoop-sized datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake (2017–2018)&lt;/strong&gt; – Developed by Databricks to unify &lt;em&gt;batch and streaming&lt;/em&gt; in Spark. Its transaction log design (_delta_log) gave data lakes database-like commits and time-travel capabilities, making it a cornerstone of the “lakehouse” concept.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg (2018)&lt;/strong&gt; – Born at Netflix to overcome Hive’s scalability and schema evolution limitations. Its snapshot/manifest-based metadata model provided atomic commits, partition evolution, and reliable time-travel at massive scale, quickly becoming an industry favorite.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Paimon (2022)&lt;/strong&gt; – Emerging from Alibaba’s Flink ecosystem, Paimon was built &lt;em&gt;streaming-first&lt;/em&gt;. Its LSM-tree design optimized for high-throughput upserts and continuous compaction, positioning it as a bridge between real-time CDC ingestion and analytics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DuckLake (2025)&lt;/strong&gt; – The newest entrant, introduced by the DuckDB/MotherDuck team. Instead of managing JSON or Avro metadata files, DuckLake stores all table metadata in a relational database. This catalog-centric design aims to simplify consistency, enable multi-table transactions, and drastically speed up query planning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These formats represent &lt;strong&gt;waves of innovation&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First wave (Hudi, Delta): Improving upon the concept of tables on the data lake.&lt;/li&gt;
&lt;li&gt;Second wave (Iceberg): focusing on batch reliability, schema evolution, and interoperability.&lt;/li&gt;
&lt;li&gt;Third wave (Paimon, DuckLake): rethinking the architecture for real-time data and metadata simplicity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, we’ll dive into &lt;strong&gt;Apache Iceberg&lt;/strong&gt; in detail, its metadata structure, features, and why it has become the default choice for many modern lakehouse deployments.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Batch-First Powerhouse&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Apache Iceberg was born at Netflix in 2018 and donated to the Apache Software Foundation in 2019. Its mission was clear: fix the long-standing problems of Hive tables, unreliable schema changes, expensive directory scans, and lack of true atomicity. Iceberg introduced a clean-slate design that scaled to petabytes while guaranteeing &lt;strong&gt;ACID transactions&lt;/strong&gt;, &lt;strong&gt;schema evolution&lt;/strong&gt;, and &lt;strong&gt;time-travel queries&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;&lt;br&gt;
Iceberg’s metadata model is built on a hierarchy of files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Table metadata file (JSON):&lt;/strong&gt; tracks schema versions, partition specs, snapshots, and properties.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; each commit creates a new snapshot, representing the table’s full state at that point in time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest lists &amp;amp; manifests (Avro):&lt;/strong&gt; hierarchical indexes of data files, enabling partition pruning and column-level stats without scanning entire directories.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design avoids reliance on directory listings, making planning queries over millions of files feasible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Add, drop, or rename columns without breaking queries, thanks to internal column IDs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; Change partitioning strategies (e.g., switch from daily to hourly partitions) without rewriting historical data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query the table as of a specific snapshot ID or timestamp.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Abstracts partition logic from users while still enabling efficient pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimistic Concurrency:&lt;/strong&gt; Writers atomically commit new snapshots, with conflict detection to prevent corruption.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Initially copy-on-write, Iceberg now also supports &lt;strong&gt;delete files&lt;/strong&gt; for merge-on-read semantics. Deletes can be tracked separately and applied at read time, reducing write amplification for frequent updates. Background compaction later consolidates these into optimized Parquet files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Iceberg’s neutrality and technical strengths have driven broad adoption. It is supported in:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engines:&lt;/strong&gt; Spark, Flink, Trino, Presto, Hive, Impala, DuckDB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud platforms:&lt;/strong&gt; AWS Athena, AWS Glue, Snowflake, BigQuery, Dremio, and more.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalogs:&lt;/strong&gt; Hive Metastore, AWS Glue, Apache Nessie, Polaris.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By late 2024, Iceberg had become the &lt;strong&gt;de facto industry standard&lt;/strong&gt; for open table formats, with adoption by Netflix, Apple, LinkedIn, Adobe, and major cloud vendors. Its community-driven governance and rapid innovation ensure it continues to evolve, recent features like &lt;strong&gt;row-level delete vectors&lt;/strong&gt; and &lt;strong&gt;REST catalogs&lt;/strong&gt; are making it even more capable.&lt;/p&gt;
&lt;p&gt;Next, we’ll look at &lt;strong&gt;Delta Lake&lt;/strong&gt;, the transaction-log–driven format that became the backbone of Databricks’ lakehouse vision.&lt;/p&gt;
&lt;h2&gt;Delta Lake: The Transaction-Log&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Delta Lake was introduced by Databricks around 2017–2018 to address Spark’s biggest gap: reliable transactions on cloud object storage. Open-sourced in 2019 under the Linux Foundation, Delta Lake became the backbone of Databricks’ &lt;strong&gt;lakehouse&lt;/strong&gt; pitch, combining data warehouse reliability with the scalability of data lakes. Its design centered on a simple but powerful idea: use a &lt;strong&gt;transaction log&lt;/strong&gt; to coordinate all changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;&lt;br&gt;
At the core of every Delta table is the &lt;code&gt;_delta_log&lt;/code&gt; directory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;JSON transaction files:&lt;/strong&gt; Each commit appends a JSON file describing added/removed data files, schema changes, and table properties.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Checkpoints (Parquet):&lt;/strong&gt; Periodic checkpoints compact the log for faster reads, storing the authoritative list of active files at a given version.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning:&lt;/strong&gt; Every commit is versioned sequentially, making time-travel queries straightforward (&lt;code&gt;VERSION AS OF&lt;/code&gt; or &lt;code&gt;TIMESTAMP AS OF&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This log-based design is simple and easy to reconstruct: replay JSON logs from the last checkpoint to reach the latest state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Ensures consistent reads and writes, even under concurrent Spark jobs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Enforcement &amp;amp; Evolution:&lt;/strong&gt; Protects against incompatible writes while allowing schema growth.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query historical versions for auditing or rollback.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified Batch &amp;amp; Streaming:&lt;/strong&gt; Spark Structured Streaming and batch jobs can read/write the same Delta table, reducing architectural complexity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance Optimizations:&lt;/strong&gt; Features like Z-order clustering, data skipping, and caching improve query speed (especially in Databricks’ runtime).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Change Data Feed (CDF):&lt;/strong&gt; Exposes row-level changes between versions, useful for downstream syncs and CDC pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Delta primarily uses &lt;strong&gt;copy-on-write&lt;/strong&gt;: updates and deletes rewrite entire Parquet files while marking old ones as removed in the log. This guarantees atomicity but can be expensive at scale. To mitigate, Delta introduced &lt;strong&gt;deletion vectors&lt;/strong&gt; (in newer releases), which track row deletions without rewriting whole files, closer to merge-on-read semantics. Upserts are supported via SQL &lt;code&gt;MERGE INTO&lt;/code&gt;, commonly used for database change capture workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Delta Lake is strongest in the &lt;strong&gt;Spark ecosystem&lt;/strong&gt; and is the default format in Databricks. It’s also supported by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engines:&lt;/strong&gt; Spark (native), Flink, Trino/Presto (via connectors).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clouds:&lt;/strong&gt; AWS EMR, Azure Synapse, and some GCP services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Libraries:&lt;/strong&gt; Delta Standalone (Java), Delta Rust, and integrations for Python beyond Spark.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While its openness has improved since Delta 2.0, much of its adoption remains tied to Databricks. Still, Delta Lake is one of the most widely used formats in production, powering pipelines at thousands of organizations.&lt;/p&gt;
&lt;p&gt;Next, we’ll explore &lt;strong&gt;Apache Hudi&lt;/strong&gt;, the pioneer of incremental processing and near-real-time data lake ingestion.&lt;/p&gt;
&lt;h2&gt;Apache Hudi: The Incremental Pioneer&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Apache Hudi (short for &lt;em&gt;Hadoop Upserts Deletes and Incrementals&lt;/em&gt;) was created at Uber in 2016 to solve a pressing challenge: keeping Hive tables up to date with fresh, continuously changing data. Uber needed a way to ingest ride updates, user changes, and event streams into their Hadoop data lake without waiting hours for batch jobs. Open-sourced in 2017 and donated to Apache in 2019, Hudi became the first widely adopted table format to support &lt;strong&gt;row-level upserts and deletes&lt;/strong&gt; directly on data lakes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;&lt;br&gt;
Hudi organizes tables around a &lt;strong&gt;commit timeline&lt;/strong&gt; stored in a &lt;code&gt;.hoodie&lt;/code&gt; directory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Commit files:&lt;/strong&gt; Metadata describing which data files were added/removed at each commit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;COW vs MOR modes:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Copy-on-Write (COW):&lt;/em&gt; Updates replace entire Parquet files, similar to Iceberg/Delta.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Merge-on-Read (MOR):&lt;/em&gt; Updates land in small Avro &lt;strong&gt;delta log files&lt;/strong&gt;, merged with base Parquet files at read time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexes:&lt;/strong&gt; Bloom filters or hash indexes help locate records by primary key, making upserts efficient.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-mode design gives engineers control over the trade-off between &lt;strong&gt;write latency&lt;/strong&gt; and &lt;strong&gt;read latency&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Upserts &amp;amp; Deletes by Key:&lt;/strong&gt; Guarantees a single latest record per primary key, ideal for CDC ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Incremental Pulls:&lt;/strong&gt; Query only the rows changed since a given commit, enabling efficient downstream pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction:&lt;/strong&gt; Background jobs merge log files into larger Parquet files for query efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Savepoints &amp;amp; Rollbacks:&lt;/strong&gt; Manage table states explicitly, ensuring recovery from bad data loads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible Indexing:&lt;/strong&gt; Choose partitioned, global, or custom indexes to balance performance with storage cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Hudi was designed for this problem. In COW mode, updates rewrite files. In MOR mode, updates are appended as &lt;strong&gt;log blocks&lt;/strong&gt;, making them queryable almost immediately. Readers can choose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Snapshot mode&lt;/em&gt; (base + logs for freshest data).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Read-optimized mode&lt;/em&gt; (compacted base files for speed).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Deletes are handled similarly, either as soft deletes in logs or hard deletes during compaction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Hudi integrates tightly with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engines:&lt;/strong&gt; Spark (native datasource), Flink (growing support), Hive, Trino/Presto.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clouds:&lt;/strong&gt; AWS EMR and AWS Glue have built-in Hudi support, making it popular on S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming:&lt;/strong&gt; Confluent Kafka, Debezium, and Flink CDC can stream directly into Hudi tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While Iceberg and Delta now dominate conversations, Hudi remains a strong choice for &lt;strong&gt;near real-time ingestion and CDC use cases&lt;/strong&gt;, particularly in AWS-centric stacks. Its flexibility (COW vs MOR) and incremental consumption features make it especially valuable for pipelines that need &lt;strong&gt;fast data freshness without sacrificing reliability&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Next, we’ll examine &lt;strong&gt;Apache Paimon&lt;/strong&gt;, the streaming-first format that extends Hudi’s incremental vision with an LSM-tree architecture.&lt;/p&gt;
&lt;h2&gt;Apache Paimon: Streaming-First by Design&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Apache Paimon began life as &lt;strong&gt;Flink Table Store&lt;/strong&gt; at Alibaba in 2022, targeting the need for continuous, real-time data ingestion directly into data lakes. It entered the Apache Incubator in 2023 under the name &lt;em&gt;Paimon&lt;/em&gt;. Unlike Iceberg or Delta, which started with batch analytics and later added streaming features, Paimon was &lt;em&gt;streaming-first&lt;/em&gt;. Its mission: make data lakes act like a materialized view that is always up to date.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata &amp;amp; Architecture&lt;/strong&gt;&lt;br&gt;
Paimon uses a &lt;strong&gt;Log-Structured Merge-tree (LSM) design&lt;/strong&gt; inspired by database internals:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MemTables and flushes:&lt;/strong&gt; Incoming data is written to in-memory buffers, then flushed to small immutable files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-level compaction:&lt;/strong&gt; Files are continuously merged into larger sorted files in the background.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; Each compaction or commit produces a new snapshot, allowing both &lt;em&gt;batch queries&lt;/em&gt; and &lt;em&gt;streaming reads&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Primary-key awareness:&lt;/strong&gt; Tables can enforce keys and apply merge rules (e.g., last-write-wins or aggregate merges).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This architecture makes &lt;strong&gt;frequent row-level changes cheap&lt;/strong&gt; (append-only writes) while deferring heavy merges to compaction tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Upserts &amp;amp; Deletes:&lt;/strong&gt; Native support for continuous CDC ingestion with efficient row-level operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Merge Engines:&lt;/strong&gt; Configurable rules for handling key collisions (e.g., overwrite, aggregate, or log-append).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dual Read Modes:&lt;/strong&gt; Query as a static snapshot (batch) or as a change stream (streaming).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming/Batch Unification:&lt;/strong&gt; The same table can power batch analytics and real-time dashboards.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deletion Vectors:&lt;/strong&gt; Efficiently tracks row deletions without rewriting base files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Unlike Iceberg (COW with delete files) or Delta (COW with deletion vectors), Paimon is natively &lt;strong&gt;merge-on-read&lt;/strong&gt;. Updates and deletes are appended as small log segments, queryable immediately. Background compaction gradually merges them into optimized columnar files. This makes Paimon highly efficient for &lt;strong&gt;high-velocity workloads&lt;/strong&gt; like IoT streams, CDC pipelines, or real-time leaderboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Paimon integrates tightly with &lt;strong&gt;Apache Flink&lt;/strong&gt;, where it feels like a natural extension of Flink SQL. It also has growing support for Spark, Hive, Trino/Presto, and OLAP systems like StarRocks and Doris. Adoption is strongest among teams building &lt;strong&gt;streaming lakehouses&lt;/strong&gt;, particularly those already invested in Flink. While younger than Iceberg or Delta, Paimon is rapidly attracting attention as organizations push for sub-minute data freshness.&lt;/p&gt;
&lt;p&gt;Next, we’ll turn to &lt;strong&gt;DuckLake&lt;/strong&gt;, the newest entrant that rethinks table metadata management by moving it entirely into SQL databases.&lt;/p&gt;
&lt;h2&gt;DuckLake: Metadata Reimagined with SQL&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
DuckLake is the newest table format, introduced in 2025 by the DuckDB and MotherDuck teams. Unlike earlier formats that manage metadata with JSON logs or Avro manifests, DuckLake flips the script: it stores &lt;strong&gt;all table metadata in a relational SQL database&lt;/strong&gt;. This approach is inspired by how cloud warehouses like Snowflake and BigQuery already manage metadata internally, but DuckLake makes it open and interoperable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata &amp;amp; Architecture&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL Catalog:&lt;/strong&gt; Metadata such as snapshots, schemas, file lists, and statistics are persisted as ordinary relational tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transactions:&lt;/strong&gt; Updates to metadata happen through standard SQL transactions, ensuring strong ACID guarantees without relying on object-store semantics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Transactions:&lt;/strong&gt; Because it’s database-backed, DuckLake supports atomic operations across multiple tables, something file-based formats struggle with.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Storage:&lt;/strong&gt; Data remains in Parquet files on cloud or local storage, DuckLake just replaces the metadata layer with SQL.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design dramatically reduces the complexity of planning queries (no manifest scanning), makes commits faster, and enables features like &lt;strong&gt;cross-table consistency&lt;/strong&gt; (possible in Apache Iceberg if using the Nessie catalog).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL-Native Metadata:&lt;/strong&gt; Easy to query, debug, or extend using plain SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast Commits &amp;amp; Planning:&lt;/strong&gt; Small updates don’t require writing multiple manifest files, just SQL inserts/updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-Table Atomicity:&lt;/strong&gt; Multi-table changes commit together, a unique strength.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Familiar Deployment:&lt;/strong&gt; The catalog can run on DuckDB, PostgreSQL, or any transactional SQL database.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
DuckLake handles updates and deletes via &lt;strong&gt;copy-on-write&lt;/strong&gt; on Parquet files, but the metadata transaction is nearly instantaneous. Row-level changes are coordinated by the SQL catalog, avoiding the latency and eventual consistency pitfalls of cloud storage–based logs. In effect, DuckLake behaves like Iceberg for data files but with much faster commit cycles.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Primary Engine:&lt;/strong&gt; DuckDB, via a DuckLake extension.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Potential Integrations:&lt;/strong&gt; Any SQL-aware engine could adopt DuckLake, since the catalog is just relational tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt; Analytics sandboxes, developer-friendly data apps, and teams seeking simplicity without deploying heavy metadata services.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As of 2025, DuckLake is still young but has sparked excitement by simplifying lakehouse architecture. It’s best seen as a complement to more mature formats, with particular appeal to DuckDB users and teams tired of managing complex metadata stacks.&lt;/p&gt;
&lt;p&gt;Next, we’ll step back and &lt;strong&gt;compare all five formats side by side&lt;/strong&gt;, looking at metadata design, row-level update strategies, ecosystem support, and adoption trends.&lt;/p&gt;
&lt;h2&gt;Comparing the Open Table Formats&lt;/h2&gt;
&lt;p&gt;Now that we’ve walked through each format individually, let’s compare them across the dimensions that matter most to data engineers and architects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Metadata Architecture&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Hierarchical &lt;em&gt;snapshots + manifests&lt;/em&gt;. Excellent for pruning large datasets but metadata can be complex.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Sequential &lt;em&gt;transaction log&lt;/em&gt; (&lt;code&gt;_delta_log&lt;/code&gt;). Simple and efficient for versioning, but logs can grow large without checkpoints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; &lt;em&gt;Commit timeline&lt;/em&gt; with optional delta logs. Flexible but more operational overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; &lt;em&gt;LSM-tree style&lt;/em&gt; compaction with snapshots. Streaming-friendly and highly write-efficient.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Metadata in a &lt;em&gt;SQL database&lt;/em&gt;. Simplifies commits and query planning, enables multi-table transactions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Row-Level Changes&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Copy-on-write by default, with &lt;em&gt;delete files&lt;/em&gt; for merge-on-read.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Copy-on-write, plus &lt;em&gt;deletion vectors&lt;/em&gt; in newer versions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Dual modes: COW for read-optimized, MOR for low-latency upserts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Always merge-on-read via &lt;em&gt;LSM-tree segments&lt;/em&gt;, optimized for frequent updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Copy-on-write, but with faster commit cycles thanks to SQL-backed metadata.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ecosystem Support&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Widest engine support (Spark, Flink, Trino, Presto, Hive, Snowflake, Athena, BigQuery, Dremio, DuckDB).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Deep Spark and Databricks integration; expanding connectors for Flink and Trino.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Strong in Spark, Hive, Presto, and AWS (Glue, EMR). Flink support growing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Native to Flink; Spark and Trino integration improving; also ties to OLAP systems like Doris/StarRocks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Early-stage, centered on DuckDB; potential for other SQL engines to adopt.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Adoption Trends&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Emerging as the &lt;em&gt;industry standard&lt;/em&gt; for open table formats, with broad vendor alignment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Dominant within Databricks/Spark ecosystems; adoption tied to Databricks customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Niche but strong in CDC and near real-time use cases; proven at scale in companies like Uber.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Rising fast in the Flink/streaming community; positioned as the “streaming lakehouse” format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Newest entrant, appealing for simplicity and developer-friendliness; adoption still experimental.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, we’ll step back and examine &lt;strong&gt;industry trends&lt;/strong&gt; shaping the adoption of these formats and what they signal for the future of the lakehouse ecosystem.&lt;/p&gt;
&lt;h2&gt;Industry Trends in Table Format Adoption&lt;/h2&gt;
&lt;p&gt;The “table format wars” of the past few years are starting to settle into clear patterns of adoption. While no single format dominates every use case, the industry is coalescing around certain choices based on scale, latency, and ecosystem needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Iceberg as the Default Standard&lt;/strong&gt;&lt;br&gt;
Iceberg has emerged as the most widely supported and vendor-neutral choice. Cloud providers like AWS, Google, and Snowflake have all added native support, and query engines like Trino, Presto, Hive, and Flink integrate with it out-of-the-box. Its Apache governance and cross-engine compatibility make it the safe long-term bet for enterprises standardizing on a single open format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake in the Spark/Databricks World&lt;/strong&gt;&lt;br&gt;
Delta Lake remains the default in Spark- and Databricks-heavy shops. Its simplicity (transaction logs) and seamless batch/stream integration continue to attract teams already invested in Spark. While its ecosystem is narrower than Iceberg’s, Delta Lake’s deep integration with Databricks runtime and machine learning workflows ensures strong adoption in that ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hudi in CDC and Incremental Ingestion&lt;/strong&gt;&lt;br&gt;
Hudi carved out a niche in &lt;strong&gt;change data capture (CDC)&lt;/strong&gt; and &lt;strong&gt;near real-time ingestion&lt;/strong&gt;. Telecom, fintech, and e-commerce companies still rely on Hudi for incremental pipelines, especially on AWS where Glue and EMR make it easy to deploy. While Iceberg and Delta have added incremental features, Hudi’s head start and MOR tables keep it relevant for low-latency ingestion scenarios.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Paimon and the Rise of Streaming Lakehouses&lt;/strong&gt;&lt;br&gt;
As real-time analytics demand grows, Paimon is gaining momentum in the &lt;strong&gt;Flink community&lt;/strong&gt; and among companies building streaming-first pipelines. Its LSM-tree design positions it as the go-to choice for high-velocity data, IoT streams, and CDC-heavy architectures. Although young, its momentum signals a broader shift: the next wave of lakehouse innovation is about &lt;strong&gt;sub-minute freshness&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DuckLake and Metadata Simplification&lt;/strong&gt;&lt;br&gt;
DuckLake reflects a newer trend: &lt;strong&gt;rethinking metadata management&lt;/strong&gt;. By moving metadata into SQL databases, it dramatically simplifies operations and enables cross-table transactions. Adoption is still experimental, but DuckLake has sparked interest among teams who want lakehouse features without managing complex catalogs or metastores. Its trajectory will likely influence how future formats handle metadata.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Convergence and Interoperability&lt;/strong&gt;&lt;br&gt;
One notable trend: features are converging. Iceberg now supports row-level deletes via delete files; Delta added deletion vectors; Hudi and Paimon both emphasize streaming upserts. Tooling is also evolving toward interoperability, catalog services like Apache Nessie and Polaris aim to support multiple formats, and BI engines increasingly connect to all.&lt;/p&gt;
&lt;p&gt;In short:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg&lt;/strong&gt; is becoming the industry’s lingua franca.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta&lt;/strong&gt; thrives in Databricks-first stacks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi&lt;/strong&gt; holds ground in CDC and incremental ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon&lt;/strong&gt; is rising with real-time streaming needs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake&lt;/strong&gt; challenges conventions with SQL-backed simplicity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, we’ll wrap up with &lt;strong&gt;guidance on how to choose the right format&lt;/strong&gt; based on your workloads, ecosystem, and data engineering priorities.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Open Table Format&lt;/h2&gt;
&lt;p&gt;With five strong options on the table, Iceberg, Delta Lake, Hudi, Paimon, and DuckLake, the choice depends less on “which is best” and more on &lt;strong&gt;which aligns with your workloads, ecosystem, and priorities&lt;/strong&gt;. Here’s how to think about it:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Apache Iceberg&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want the broadest &lt;strong&gt;engine and vendor support&lt;/strong&gt; (Spark, Flink, Trino, Presto, Hive, Dremio, Snowflake, BigQuery, etc.).&lt;/li&gt;
&lt;li&gt;Your workloads are &lt;strong&gt;batch-heavy&lt;/strong&gt; and prioritize consistent snapshots, schema evolution, and large-scale analytics.&lt;/li&gt;
&lt;li&gt;You want to standardize on the &lt;strong&gt;emerging industry default&lt;/strong&gt; with the widest community and neutral Apache governance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Delta Lake&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your data stack is &lt;strong&gt;Databricks-first&lt;/strong&gt; or heavily Spark-centric.&lt;/li&gt;
&lt;li&gt;You need seamless &lt;strong&gt;batch + streaming unification&lt;/strong&gt; with Spark Structured Streaming.&lt;/li&gt;
&lt;li&gt;You value Databricks’ ecosystem of optimizations (e.g., Z-order, caching, machine learning integrations).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Apache Hudi&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need &lt;strong&gt;frequent upserts and deletes&lt;/strong&gt; on data lakes.&lt;/li&gt;
&lt;li&gt;Your pipelines depend on &lt;strong&gt;incremental consumption&lt;/strong&gt; of data (only new/changed rows since the last commit).&lt;/li&gt;
&lt;li&gt;You want a proven option for &lt;strong&gt;CDC ingestion&lt;/strong&gt; and near real-time pipelines, especially on &lt;strong&gt;AWS Glue/EMR&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Apache Paimon&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your workloads are &lt;strong&gt;streaming-first&lt;/strong&gt;, with high-velocity CDC or IoT data.&lt;/li&gt;
&lt;li&gt;You want to unify &lt;strong&gt;real-time and batch&lt;/strong&gt; processing within the same table.&lt;/li&gt;
&lt;li&gt;You’re already invested in &lt;strong&gt;Apache Flink&lt;/strong&gt; and want a table format purpose-built for it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose DuckLake&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want &lt;strong&gt;simplicity&lt;/strong&gt; in metadata management (SQL instead of JSON/Avro manifests).&lt;/li&gt;
&lt;li&gt;You’re working in &lt;strong&gt;DuckDB/MotherDuck&lt;/strong&gt; environments or need lightweight lakehouse capabilities.&lt;/li&gt;
&lt;li&gt;You value &lt;strong&gt;fast commits, easy debugging, and multi-table atomicity&lt;/strong&gt;, even if the format is newer and less battle-tested.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Final Takeaway&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;: the &lt;em&gt;universal standard&lt;/em&gt; for long-term interoperability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta&lt;/strong&gt;: the &lt;em&gt;Databricks/Spark-native&lt;/em&gt; option.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi&lt;/strong&gt;: the &lt;em&gt;incremental/CDC pioneer&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon&lt;/strong&gt;: the &lt;em&gt;streaming-first disruptor&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake&lt;/strong&gt;: the &lt;em&gt;metadata simplifier&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No matter which you choose, adopting an open table format is the key to turning your data lake into a true &lt;strong&gt;lakehouse&lt;/strong&gt;: reliable, flexible, and future-proof.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Open table formats are no longer niche, they’re the foundation of the modern data stack. Whether your challenge is batch analytics, real-time ingestion, or simplifying metadata, there’s a format designed to meet your needs. The smart path forward isn’t just picking one blindly, but aligning your choice with your &lt;strong&gt;data velocity, tooling ecosystem, and long-term governance strategy&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In practice, many organizations run more than one format side by side. The good news: as open standards mature, interoperability and ecosystem support are expanding, making it easier to evolve over time without locking yourself into a dead end.&lt;/p&gt;
&lt;p&gt;The lakehouse era is here, and open table formats are its backbone.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/strong&gt;
&lt;strong&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The 2025 &amp; 2026 Ultimate Guide to the Data Lakehouse and the Data Lakehouse Ecosystem</title><link>https://iceberglakehouse.com/posts/2025-09-2026-guide-to-data-lakehouses/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-2026-guide-to-data-lakehouses/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-09-2026-guide-to-data-lakehouses/)....</description><pubDate>Tue, 23 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-09-2026-guide-to-data-lakehouses/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Listings&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Year-end 2025 reflections, looking ahead to 2026&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Over the past few years, data platforms have crossed a tipping point. Rigid, centralized warehouses proved great at trustworthy BI but struggled with diverse data and elastic scale. Open data lakes delivered low-cost storage and freedom of choice, yet lacked the transactional rigor and performance guarantees analytics teams rely on. In 2025, the data &lt;strong&gt;lakehouse&lt;/strong&gt; matured from an idea into an operating model: open table formats, transactional metadata, and multi-engine access over a single, governed body of data.&lt;/p&gt;
&lt;p&gt;This guide distills what changed, why it matters, and how to put it to work. We’ll start by clarifying &lt;em&gt;where warehouses shine and crack&lt;/em&gt;, &lt;em&gt;where lakes empower and swamp&lt;/em&gt;, and why older directory-based table designs (think classic Hive tables) hit scaling and consistency limits. From there, we’ll show how modern table formats, Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon, solved those limits by tracking &lt;strong&gt;files and snapshots&lt;/strong&gt; instead of folders, enabling ACID transactions, time travel, and intelligent pruning at petabyte scale.&lt;/p&gt;
&lt;p&gt;2025 also cemented a practical reference architecture. A successful lakehouse now looks less like a monolith and more like a &lt;strong&gt;layered system&lt;/strong&gt;: cloud object storage for durability and cost, an open table format for transactions and evolution, ingestion that blends batch and streaming, a catalog for governance and discoverability, and a flexible consumption layer that serves SQL, BI, notebooks, and AI agents with consistent semantics.&lt;/p&gt;
&lt;p&gt;Why now? Three forces converged this year:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Streaming-by-default&lt;/strong&gt; workloads turned “daily batch” into “continuous micro-batch,” demanding exactly-once commits and small-file management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI and agentic workflows&lt;/strong&gt; moved from proofs of concept to production, generating highly variable, ad-hoc queries that require low-latency acceleration without brittle hand-tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open interoperability&lt;/strong&gt; became table stakes, organizations want one source of truth read by many engines, not many copies of truth managed by many teams.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This guide is an accessible deep dive for Data Engineers and Data Architects. You’ll get a clear mental model of the formats, their metadata structures, and the operational playbook: compaction, snapshot expiration, partition evolution, and reflection/materialization strategies for speed at scale. We’ll also survey the ingestion and streaming ecosystem (connectors, CDC, stream processors), Python-native options for lakehouse workloads (Polars, DuckDB, DataFusion, Daft, Dask), and emerging edge patterns where inference runs close to the data.&lt;/p&gt;
&lt;p&gt;Finally, we’ll close with a curated reading list, books and long-form resources that stood out in 2025, and pragmatic guidance on choosing components in 2026 without locking yourself in. If your mandate is to deliver trustworthy, performant, and AI-ready analytics on open data, this guide is your map.&lt;/p&gt;
&lt;h2&gt;The Challenges in Modern Data Architecture&lt;/h2&gt;
&lt;p&gt;The rise of the lakehouse didn’t happen in a vacuum. It emerged as a response to the very real challenges of &lt;em&gt;yesterday’s dominant architectures&lt;/em&gt;, data warehouses and data lakes. Understanding their strengths and weaknesses sets the stage for why the lakehouse model became inevitable.&lt;/p&gt;
&lt;h3&gt;Data Warehouses: Strength in Structure, Weakness in Flexibility&lt;/h3&gt;
&lt;p&gt;Data warehouses provided the first true enterprise-scale analytics platforms. They enforced &lt;strong&gt;schema-on-write&lt;/strong&gt;, ensuring data quality and making business intelligence consistent across the organization. For years, this was invaluable: clean, curated, trusted dashboards.&lt;/p&gt;
&lt;p&gt;But the cracks widened in the 2010s and 2020s:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rigid schemas:&lt;/strong&gt; Every change to a source system meant heavy ETL work to keep the warehouse schema in sync. New data types, JSON, images, sensor streams, didn’t fit neatly into tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High costs:&lt;/strong&gt; Warehouses couple compute and storage. Scaling for more data or users often meant overpaying for resources you didn’t fully use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency in data freshness:&lt;/strong&gt; The ETL pipelines that fed warehouses ran daily or hourly, leaving decision-makers working with stale data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limited AI/ML support:&lt;/strong&gt; Warehouses excel at structured SQL queries but aren’t designed to handle the diverse, unstructured, and large-scale data needed for machine learning.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Warehouses solved consistency but at the price of agility.&lt;/p&gt;
&lt;h3&gt;Data Lakes: Flexibility Meets the “Data Swamp”&lt;/h3&gt;
&lt;p&gt;Enter data lakes. By shifting to &lt;strong&gt;schema-on-read&lt;/strong&gt;, organizations gained the freedom to store &lt;em&gt;anything&lt;/em&gt;: logs, media, documents, semi-structured JSON, raw database dumps. Storage costs plummeted thanks to cloud object stores like S3 and ADLS, and data scientists loved having raw, unmodeled data at their fingertips.&lt;/p&gt;
&lt;p&gt;But flexibility introduced new pain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data quality issues:&lt;/strong&gt; With no enforced schema, data lakes quickly devolved into “data swamps”, vast, uncurated collections of files that few trusted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poor governance:&lt;/strong&gt; Security, lineage, and access controls were bolted on, often inconsistently. Teams struggled to know what data existed and whether it was safe to use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance bottlenecks:&lt;/strong&gt; Query engines like Hive, Spark, or Presto had to scan massive directories of files. Without transactional guarantees, concurrent writes could corrupt datasets or leave analysts with incomplete results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High operational overhead:&lt;/strong&gt; Managing partitions, small files, and manual compactions became part of daily operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Lakes solved agility but at the price of trust.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The gap was clear:&lt;/strong&gt; warehouses offered &lt;strong&gt;trust but no flexibility&lt;/strong&gt;, while lakes offered &lt;strong&gt;flexibility but no trust&lt;/strong&gt;. By the early 2020s, organizations wanted the best of both, structured reliability &lt;em&gt;and&lt;/em&gt; open flexibility, laying the foundation for the modern data lakehouse.&lt;/p&gt;
&lt;h2&gt;What Is Hive and the Challenges of Hive Tables&lt;/h2&gt;
&lt;p&gt;Before the lakehouse era, &lt;strong&gt;Apache Hive&lt;/strong&gt; was the workhorse that made large-scale data in Hadoop clusters queryable with SQL. Hive introduced the &lt;em&gt;Hive Metastore&lt;/em&gt;, which stored table definitions (schemas, partitions, and locations), enabling analysts to run SQL-like queries over files sitting in HDFS or cloud storage. It was one of the first major attempts to give a data-lake-like environment a relational feel.&lt;/p&gt;
&lt;p&gt;But Hive’s approach, tracking &lt;strong&gt;directories of files&lt;/strong&gt; as tables, brought structural limitations that became bottlenecks as datasets and expectations grew.&lt;/p&gt;
&lt;h3&gt;Directory-Centric Table Management&lt;/h3&gt;
&lt;p&gt;In Hive, each table maps to a folder, and each partition to a subfolder. Query engines scan these directories at runtime to discover files. While this worked when data volumes were modest, modern cloud object stores made directory scans painfully slow. Listing millions of files before executing a query often dominated total query time.&lt;/p&gt;
&lt;h3&gt;Lack of ACID Transactions&lt;/h3&gt;
&lt;p&gt;Hive tables were essentially &lt;strong&gt;append-only&lt;/strong&gt;. Without built-in transactions, concurrent writers risked corrupting tables, and readers could encounter partial data during an update. Later ACID extensions attempted to patch this with delta files and compaction, but these added complexity and overhead, and weren’t consistently supported across engines.&lt;/p&gt;
&lt;h3&gt;Painful Updates and Schema Evolution&lt;/h3&gt;
&lt;p&gt;Modifying data in Hive tables was inefficient:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Updates and deletes&lt;/strong&gt; required rewriting entire partitions or entire tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema changes&lt;/strong&gt; (like renaming a column) often broke downstream jobs or forced costly rewrites.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result: rigid datasets that were expensive to maintain and slow to evolve with business needs.&lt;/p&gt;
&lt;h3&gt;The Small Files Problem&lt;/h3&gt;
&lt;p&gt;Hive ingestion pipelines, especially those running frequently, created floods of small files. Query performance degraded because engines had to open and read from thousands of tiny files. Without built-in small-file management, engineers had to implement periodic compaction jobs to maintain performance.&lt;/p&gt;
&lt;p&gt;Hive was a critical stepping stone: it proved the value of SQL on big data and inspired the metadata-driven approach all lakehouse formats now follow. But its reliance on directory-based tracking and limited support for transactional, evolving workloads ultimately constrained its ability to power the next generation of data platforms.&lt;/p&gt;
&lt;h2&gt;The Innovation of Tracking Tables by Tracking Files vs. Tracking Directories&lt;/h2&gt;
&lt;p&gt;The turning point from Hive-style tables to modern lakehouse formats came with a deceptively simple idea:&lt;br&gt;
&lt;strong&gt;stop tracking directories of files, and start tracking individual files in metadata.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Why Directories Fall Short&lt;/h3&gt;
&lt;p&gt;Directory-based tracking (as in Hive) meant that the engine had to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;List every file in a partition directory before running a query.&lt;/li&gt;
&lt;li&gt;Infer table state from the file system at query time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This created major problems in cloud storage, where operations like &lt;code&gt;LIST&lt;/code&gt; are slow and expensive. It also made concurrency hard, two jobs writing to the same folder could overwrite each other’s files without the catalog knowing until it was too late.&lt;/p&gt;
&lt;h3&gt;File-Level Tracking&lt;/h3&gt;
&lt;p&gt;Modern table formats introduced &lt;strong&gt;file-level manifests&lt;/strong&gt;: structured metadata that explicitly records every file that belongs to a table, along with statistics about its contents. Instead of scanning folders, engines read this compact metadata to know exactly which files to use.&lt;/p&gt;
&lt;p&gt;Benefits include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster planning:&lt;/strong&gt; Queries skip expensive directory listings, instead reading a few manifest files that describe thousands of data files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Atomic commits:&lt;/strong&gt; Updates create a new manifest (or snapshot) in a single operation. Readers either see the old version or the new one, never a half-written state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Metadata can track multiple schema versions, allowing columns to be added, renamed, or dropped without rewriting entire datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition independence:&lt;/strong&gt; Partitioning is recorded in metadata, not folder structures, enabling &lt;em&gt;hidden partitions&lt;/em&gt; and even partition evolution over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-grained deletes and upserts:&lt;/strong&gt; Since every file is individually tracked, formats can support row-level operations by marking old files as deleted and adding new ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snapshots and Time Travel&lt;/h3&gt;
&lt;p&gt;By treating metadata itself as a versioned object, table formats unlocked &lt;strong&gt;time travel&lt;/strong&gt;: the ability to query data as it existed at any point in time. Each snapshot references a specific set of files, creating a complete, immutable view of the table at that moment.&lt;/p&gt;
&lt;p&gt;This shift, from directories to files, from implicit state to explicit metadata, transformed raw data lakes into reliable, database-like systems. It’s the foundation that made the lakehouse architecture possible and paved the way for the new generation of table formats.&lt;/p&gt;
&lt;h2&gt;The New Generation of Data Lake Tables: Iceberg, Delta, Hudi, and Paimon&lt;/h2&gt;
&lt;p&gt;With file-level tracking as the breakthrough, several open-source projects emerged to redefine how data lakes operate. These &lt;strong&gt;table formats&lt;/strong&gt; provide the transactional, metadata-rich foundation that transforms a raw data lake into a full-fledged lakehouse. Each project shares core principles, ACID transactions, schema evolution, and time travel, but emphasizes different strengths.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Born at Netflix and now a top-level Apache project, &lt;strong&gt;Iceberg&lt;/strong&gt; is designed for &lt;strong&gt;engine-agnostic interoperability&lt;/strong&gt;. Its hierarchical metadata structure (table metadata → manifest lists → manifest files) allows scaling to billions of files while still enabling fast query planning.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hidden partitioning and partition evolution.&lt;/li&gt;
&lt;li&gt;Broad engine support (Spark, Flink, Trino, Presto, Dremio, and more).&lt;/li&gt;
&lt;li&gt;Strong focus on openness through the &lt;strong&gt;REST catalog API&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Rich schema evolution, including column renames and type promotions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Iceberg has become a de facto standard for enterprises seeking an open, future-proof lakehouse.&lt;/p&gt;
&lt;h3&gt;Delta Lake&lt;/h3&gt;
&lt;p&gt;Originally created by Databricks, &lt;strong&gt;Delta Lake&lt;/strong&gt; popularized the concept of a transactional log for data lakes. It uses an &lt;strong&gt;append-only transaction log&lt;/strong&gt; (&lt;code&gt;_delta_log&lt;/code&gt;) with JSON entries and Parquet checkpoints to track file state.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ACID transactions tightly integrated with Apache Spark.&lt;/li&gt;
&lt;li&gt;Time travel and schema evolution.&lt;/li&gt;
&lt;li&gt;Optimizations like &lt;strong&gt;Z-Ordering&lt;/strong&gt; for clustering.&lt;/li&gt;
&lt;li&gt;Deep integration with the Databricks ecosystem, though community adoption beyond Spark is growing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Delta remains particularly strong for teams standardized on Databricks or Spark-centric workflows.&lt;/p&gt;
&lt;h3&gt;Apache Hudi&lt;/h3&gt;
&lt;p&gt;Developed at Uber, &lt;strong&gt;Hudi&lt;/strong&gt; was one of the earliest attempts to bring database-like capabilities to data lakes. It excels at &lt;strong&gt;incremental processing&lt;/strong&gt; and &lt;strong&gt;change data capture (CDC)&lt;/strong&gt;.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Two storage modes: &lt;strong&gt;Copy-on-Write (CoW)&lt;/strong&gt; for read-optimized workloads, and &lt;strong&gt;Merge-on-Read (MoR)&lt;/strong&gt; for write-heavy, near-real-time use cases.&lt;/li&gt;
&lt;li&gt;Native upserts and deletes.&lt;/li&gt;
&lt;li&gt;Built-in indexing for record-level operations.&lt;/li&gt;
&lt;li&gt;Tight integrations with Spark, Flink, and Hive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hudi is especially attractive for pipelines that demand frequent updates and streaming ingestion.&lt;/p&gt;
&lt;h3&gt;Apache Paimon&lt;/h3&gt;
&lt;p&gt;A newer entrant, &lt;strong&gt;Paimon&lt;/strong&gt; (formerly Flink Table Store) emphasizes &lt;strong&gt;streaming-first lakehouse design&lt;/strong&gt;. It uses an &lt;strong&gt;LSM-tree style file organization&lt;/strong&gt; to unify batch and stream processing.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Native CDC and incremental queries.&lt;/li&gt;
&lt;li&gt;Deep integration with Apache Flink.&lt;/li&gt;
&lt;li&gt;Snapshot isolation with continuous compaction.&lt;/li&gt;
&lt;li&gt;Growing ecosystem to support Spark, Hive, and beyond.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Paimon fills a niche where &lt;strong&gt;real-time data ingestion and analytics converge&lt;/strong&gt;, making it compelling for event-driven architectures.&lt;/p&gt;
&lt;p&gt;Together, these formats represent the evolution from &lt;strong&gt;directory-based tables&lt;/strong&gt; to &lt;strong&gt;transactional, metadata-driven lakehouse systems&lt;/strong&gt;. Each brings a unique philosophy: Iceberg for openness, Delta for Spark-native simplicity, Hudi for streaming updates, and Paimon for unified batch-stream processing. Understanding their trade-offs is critical when designing a modern data platform.&lt;/p&gt;
&lt;h2&gt;Fundamental Architecture of the Data Lakehouse&lt;/h2&gt;
&lt;p&gt;At its core, the &lt;strong&gt;data lakehouse&lt;/strong&gt; is not a single product but an architectural pattern. It blends the scalability and openness of data lakes with the transactional reliability and governance of data warehouses. By 2025, a consensus emerged: a lakehouse succeeds when it clearly defines &lt;strong&gt;layers&lt;/strong&gt;, each with its own role but working together as a cohesive whole.&lt;/p&gt;
&lt;h3&gt;Storage as the Foundation&lt;/h3&gt;
&lt;p&gt;The lakehouse begins with &lt;strong&gt;cloud object storage&lt;/strong&gt; (e.g., Amazon S3, Azure Data Lake Storage, Google Cloud Storage). This layer offers low-cost, durable, infinitely scalable storage for all file types. Unlike warehouses, it decouples compute from storage, multiple engines can read from the same data without duplicating it.&lt;/p&gt;
&lt;h3&gt;Metadata and Table Formats&lt;/h3&gt;
&lt;p&gt;On top of storage sits a &lt;strong&gt;table format&lt;/strong&gt;, the metadata layer that turns a set of files into a logical table. Formats like Iceberg, Delta, Hudi, and Paimon bring:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID transactions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema enforcement and evolution&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition pruning and statistics&lt;/strong&gt; for efficient queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time travel&lt;/strong&gt; through snapshot-based metadata&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layer is what transforms a “data swamp” into structured, queryable datasets.&lt;/p&gt;
&lt;h3&gt;Catalog and Governance&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;catalog&lt;/strong&gt; connects metadata to the outside world. It tracks what tables exist and their locations, while enforcing access policies and governance rules. Think of it as the bridge between storage and consumption. Examples include Hive Metastore, AWS Glue, Unity Catalog, Dremio Catalog and open-source options like Nessie or Apache Polaris.&lt;/p&gt;
&lt;h3&gt;Compute and Federation&lt;/h3&gt;
&lt;p&gt;Query engines like &lt;strong&gt;Dremio, Trino, Spark, and Flink&lt;/strong&gt; sit on top, accessing tables via the catalog. These engines provide federation, joining and querying data from multiple systems, and execute transformations, BI queries, or machine learning pipelines. The lakehouse architecture allows multiple engines to share the same data without conflict.&lt;/p&gt;
&lt;h3&gt;Consumption and Semantics&lt;/h3&gt;
&lt;p&gt;Finally, end users connect through &lt;strong&gt;BI dashboards, notebooks, or AI systems&lt;/strong&gt;. A semantic layer often sits here, defining consistent metrics and business concepts across tools. This ensures a “single version of truth” for everyone consuming data.&lt;/p&gt;
&lt;p&gt;This layered design, storage, table format, catalog, compute, and consumption, has become the reference architecture for the modern data lakehouse. It solves the warehouse vs. lake tradeoff by delivering &lt;strong&gt;flexibility, trust, and performance in one unified stack&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Metadata Structures Across Modern Table Formats&lt;/h2&gt;
&lt;p&gt;While all lakehouse table formats share the principle of tracking files rather than directories, each one implements its own &lt;strong&gt;metadata architecture&lt;/strong&gt;. Understanding these differences is crucial for choosing the right format and for operating them at scale.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; Every commit creates a new snapshot that references a set of files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifests:&lt;/strong&gt; Each snapshot points to &lt;em&gt;manifest lists&lt;/em&gt;, which then point to &lt;em&gt;manifest files&lt;/em&gt;. These manifest files contain the actual list of data files, along with stats like min/max values for columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Metadata File:&lt;/strong&gt; A JSON/Avro file storing schema versions, partition specs, snapshot history, and pointers to the current snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Hierarchical design scales to billions of files, supports hidden partitioning, and makes time travel lightweight.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Delta Lake&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transaction Log:&lt;/strong&gt; All operations are recorded in an append-only &lt;code&gt;_delta_log&lt;/code&gt; directory as JSON files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Checkpoints:&lt;/strong&gt; Periodically, the log is compacted into Parquet checkpoint files for faster reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table State:&lt;/strong&gt; Current table state is reconstructed by combining the latest checkpoint with newer JSON entries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Simple, linear log model tightly integrated with Spark; efficient for workloads within the Databricks ecosystem.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Hudi&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; A series of commit, deltacommit, and compaction files in a &lt;code&gt;.hoodie&lt;/code&gt; directory describe changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage Modes:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Copy-on-Write (CoW):&lt;/em&gt; rewrites files on update.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Merge-on-Read (MoR):&lt;/em&gt; writes delta logs and later compacts them with base files.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexing:&lt;/strong&gt; Optional record-level indexes accelerate upserts and deletes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Optimized for streaming ingestion and CDC use cases, with incremental pull queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Paimon&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LSM-Tree Inspired:&lt;/strong&gt; Uses log segments and compaction levels, optimized for high-frequency updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; Metadata tracks current file sets and supports branching for consistent queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Changelog Streams:&lt;/strong&gt; Natively emits row-level changes for downstream streaming consumers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Built for unified batch and streaming, with strong Flink integration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg emphasizes &lt;strong&gt;scalability and cross-engine interoperability&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Delta Lake focuses on &lt;strong&gt;simplicity and Spark-native performance&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Hudi delivers &lt;strong&gt;real-time upserts and incremental views&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Paimon pioneers &lt;strong&gt;streaming-first design with changelogs&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These metadata designs reflect each project’s philosophy, and they form the backbone of how the modern lakehouse balances flexibility, consistency, and speed.&lt;/p&gt;
&lt;h2&gt;Implementing a Lakehouse: The Five Core Layers&lt;/h2&gt;
&lt;p&gt;Designing a modern lakehouse isn’t about choosing a single tool, it’s about assembling the right components across &lt;strong&gt;five architectural layers&lt;/strong&gt;. Each layer has its own responsibilities, and together they create a system that is scalable, governed, and usable for analytics and AI.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Deep Dive into the 5 layers is the core of the book &amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;1. Storage Layer&lt;/h3&gt;
&lt;p&gt;This is the foundation: &lt;strong&gt;low-cost, durable storage&lt;/strong&gt; capable of holding structured, semi-structured, and unstructured data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Common choices: Amazon S3, Azure Data Lake Storage, Google Cloud Storage, or on-prem HDFS/MinIO.&lt;/li&gt;
&lt;li&gt;Data is stored in open formats such as Parquet, ORC, or Avro.&lt;/li&gt;
&lt;li&gt;Separation of storage from compute allows multiple engines to share the same data without duplication.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Table Format Layer&lt;/h3&gt;
&lt;p&gt;Here, the &lt;strong&gt;metadata format&lt;/strong&gt; gives structure and reliability to raw files.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Options: Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon.&lt;/li&gt;
&lt;li&gt;Capabilities include ACID transactions, schema evolution, partition pruning, and time travel.&lt;/li&gt;
&lt;li&gt;This layer transforms the lake from a “data swamp” into a transactional system of record.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Ingestion Layer&lt;/h3&gt;
&lt;p&gt;The ingestion layer handles &lt;strong&gt;data movement into the lakehouse&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Batch ingestion:&lt;/strong&gt; Tools like Fivetran, Airbyte, Estuary, Hevo or custom ETL jobs land data periodically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming ingestion:&lt;/strong&gt; Systems like Confluent, Aiven, StreamNative, RisingWave, Kafka, Redpanda, Pulsar, or Flink push events into table formats in near real-time.&lt;/li&gt;
&lt;li&gt;Goal: balance freshness, cost, and reliability while avoiding problems like excessive small files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Catalog &amp;amp; Governance Layer&lt;/h3&gt;
&lt;p&gt;The catalog is the &lt;strong&gt;central registry&lt;/strong&gt; of your tables, schemas, and access rules.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Examples: Hive Metastore, AWS Glue, Unity Catalog, Dremio Catalog, open-source catalogs like Nessie or Apache Polaris.&lt;/li&gt;
&lt;li&gt;Responsibilities: discovery, schema validation, access control, lineage, and auditability.&lt;/li&gt;
&lt;li&gt;Acts as the bridge between storage and compute, ensuring data is both secure and discoverable.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Federation &amp;amp; Consumption Layer&lt;/h3&gt;
&lt;p&gt;At the top, query engines and semantic layers make data consumable.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query federation engines&lt;/strong&gt; like Dremio or Trino can join lakehouse tables with other sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumption tools&lt;/strong&gt; include BI platforms (Tableau, Power BI, Looker), notebooks, and AI agents.&lt;/li&gt;
&lt;li&gt;A semantic layer ensures consistency by defining metrics and business terms across all tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt; these five layers form the blueprint of every successful lakehouse. They separate concerns, storage, metadata, movement, governance, and consumption, while enabling interoperability. The result is a unified platform that scales with data growth, adapts to new workloads, and keeps analytics both flexible and trustworthy.&lt;/p&gt;
&lt;h2&gt;Lakehouse Ingestion&lt;/h2&gt;
&lt;p&gt;Once the foundational layers are in place, the next challenge is &lt;strong&gt;getting data into the lakehouse&lt;/strong&gt; efficiently and reliably. Ingestion strategies determine not only data freshness but also table health, file organization, and downstream usability.&lt;/p&gt;
&lt;h3&gt;Batch Ingestion&lt;/h3&gt;
&lt;p&gt;Batch remains the most common entry point:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ETL/ELT Services:&lt;/strong&gt; Tools like Fivetran and Airbyte extract data from SaaS applications, relational databases, and APIs, then land it in cloud object storage. Many now write directly into open table formats (Iceberg, Delta, Hudi) rather than dumping raw CSVs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Jobs:&lt;/strong&gt; Python, Spark, or dbt pipelines often transform and load data on schedules, nightly, hourly, or in micro-batches.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advantages:&lt;/strong&gt; Predictable loads, simpler monitoring, and often easier cost control.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Challenges:&lt;/strong&gt; Data freshness is limited by schedule, and frequent batches can generate lots of small files if not managed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Streaming Ingestion&lt;/h3&gt;
&lt;p&gt;Real-time data is no longer a luxury, it’s an expectation in 2026:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Event Streams:&lt;/strong&gt; Platforms like Apache Kafka, Redpanda, Aiven, Confluent, RisingWave, StreamNative, and Apache Pulsar capture streams of events (e.g., clickstream, IoT data) and push them into the lakehouse using connectors or stream processors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CDC Pipelines:&lt;/strong&gt; Change data capture tools (Debezium, Estuary Flow) replicate updates from operational databases into Iceberg or Delta tables with low latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream Processing Engines:&lt;/strong&gt; Apache Flink and Spark Structured Streaming can apply transformations inline, then commit results directly to lakehouse tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Small File Management&lt;/h3&gt;
&lt;p&gt;One critical concern in ingestion is avoiding a &lt;strong&gt;small files problem&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each tiny file adds overhead to query planning.&lt;/li&gt;
&lt;li&gt;Solutions include writer-side batching, file-size thresholds, and downstream compaction jobs.&lt;/li&gt;
&lt;li&gt;Modern ingestion platforms often integrate with the table format’s APIs to commit larger, optimized files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Reliability and Governance&lt;/h3&gt;
&lt;p&gt;Ingestion isn’t just about moving bytes, it’s about ensuring &lt;strong&gt;trustworthy pipelines&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Idempotency:&lt;/strong&gt; Re-runs shouldn’t create duplicates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Drift Handling:&lt;/strong&gt; New source columns should be gracefully added to lakehouse tables with metadata updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring:&lt;/strong&gt; Data observability platforms (Monte Carlo, Bigeye) can alert when loads fail or data volumes deviate unexpectedly.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, ingestion is where &lt;strong&gt;data quality meets data freshness&lt;/strong&gt;. A strong strategy combines &lt;strong&gt;batch tools for breadth&lt;/strong&gt; (ingesting from many SaaS and DB sources) with &lt;strong&gt;streaming pipelines for depth&lt;/strong&gt; (real-time operational data), all while keeping file sizes healthy and metadata consistent.&lt;/p&gt;
&lt;h2&gt;Lakehouse Streaming&lt;/h2&gt;
&lt;p&gt;If 2025 was the year of “batch meets real-time,” then 2026 is the year of &lt;strong&gt;streaming-first lakehouses&lt;/strong&gt;. Instead of treating streaming as an afterthought, the modern lakehouse expects ingestion, processing, and query serving to happen continuously. This shift is powered by both table format features (incremental commits, changelogs) and by the streaming ecosystem maturing around open lakehouse standards.&lt;/p&gt;
&lt;h3&gt;Confluent&lt;/h3&gt;
&lt;p&gt;As the commercial steward of Apache Kafka, &lt;strong&gt;Confluent&lt;/strong&gt; has led in making streams and tables converge. Their &lt;strong&gt;Tableflow and Stream Designer&lt;/strong&gt; products now write directly to Iceberg and Delta Lake, providing exactly-once guarantees and seamless CDC ingestion. This reduces the need for custom Flink or Spark jobs, Kafka topics become queryable lakehouse tables in real time.&lt;/p&gt;
&lt;h3&gt;Aiven&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Aiven&lt;/strong&gt;, a managed open-source data platform provider, has expanded its Kafka, Flink, and Postgres services with native &lt;strong&gt;Iceberg integrations&lt;/strong&gt;. Their goal: give teams a turnkey way to capture events, run stream transformations, and land results directly into a governed lakehouse, without stitching together multiple vendors.&lt;/p&gt;
&lt;h3&gt;Redpanda&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Redpanda&lt;/strong&gt; brings Kafka-API-compatible streaming with higher throughput and lower latency, and in 2025 introduced &lt;strong&gt;Iceberg Topics&lt;/strong&gt;. With this feature, every topic can materialize into an Iceberg table automatically, combining log storage with table metadata. This means developers can treat the same data as both a stream and a table, depending on the workload.&lt;/p&gt;
&lt;h3&gt;StreamNative&lt;/h3&gt;
&lt;p&gt;Built around Apache Pulsar, &lt;strong&gt;StreamNative&lt;/strong&gt; pushes the lakehouse deeper into event-driven architectures. Pulsar’s tiered storage, combined with integrations for Iceberg and Delta, means historical message backlogs can be instantly queryable as tables. Their work on unifying messaging and lakehouse storage blurs the boundary between stream broker and data platform.&lt;/p&gt;
&lt;h3&gt;RisingWave&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;RisingWave&lt;/strong&gt; focuses on &lt;strong&gt;streaming databases&lt;/strong&gt;: continuously maintaining materialized views over streams. Its integration with Iceberg allows those real-time views to be published directly into the lakehouse, governed alongside batch data. This bridges operational analytics (e.g., monitoring metrics in near real time) with historical analytics in the same architecture.&lt;/p&gt;
&lt;h3&gt;Other Notables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Materialize:&lt;/strong&gt; A streaming database that outputs real-time materialized views, often targeting data lakes and warehouses as sinks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ksqlDB:&lt;/strong&gt; Kafka-native SQL for defining streaming transformations, which can also materialize tables into downstream lakehouse storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Flink:&lt;/strong&gt; Still the backbone of many custom streaming-to-lakehouse pipelines, powering advanced transformations before committing results to Iceberg, Hudi, or Delta.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; Streaming is no longer bolted onto the lakehouse, it is &lt;strong&gt;embedded&lt;/strong&gt;. Whether through Kafka, Redpanda, Pulsar, Flink, or streaming databases like RisingWave and Materialize, streams now flow directly into transactional tables. The result is a lakehouse where batch and real-time are not two separate worlds but a single, unified system delivering always-fresh data.&lt;/p&gt;
&lt;h2&gt;Lakehouse Catalogs: Architecture, Compatibility &amp;amp; When to Use Which&lt;/h2&gt;
&lt;p&gt;A &lt;strong&gt;lakehouse catalog&lt;/strong&gt; is the control plane for your open tables, it tracks metadata locations, permissions, and exposes standard APIs to every engine. Below is a concise, practitioner-focused map of today’s major options and how they fit into a multi-engine, multi-cloud lakehouse.&lt;/p&gt;
&lt;h3&gt;Apache Polaris (Incubating)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; An open-source, fully featured &lt;strong&gt;Apache Iceberg REST&lt;/strong&gt; catalog designed for vendor-neutral, multi-engine interoperability. Backed by multiple vendors and born from a cross-industry push to standardize the Iceberg catalog layer.&lt;br&gt;
&lt;strong&gt;Where it shines:&lt;/strong&gt; Teams standardizing on &lt;strong&gt;Iceberg&lt;/strong&gt; who want a portable, community-governed catalog that Spark, Flink, Trino, Dremio, StarRocks/Doris can all use via the Iceberg REST API.&lt;br&gt;
&lt;strong&gt;Notable:&lt;/strong&gt; Open governance and REST-by-default avoid lock-in and simplify multi-engine access. Also has the feature to federate other catalogs and soon other table sources.&lt;/p&gt;
&lt;h3&gt;Apache Gravitino (Incubating)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A &lt;strong&gt;geo-distributed, federated metadata lake&lt;/strong&gt; that manages metadata &lt;strong&gt;in place&lt;/strong&gt; across heterogeneous sources (file stores, RDBMS, streams) and exposes a unified view to engines like Spark/Trino/Flink.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Hybrid/multi-cloud estates with multiple catalogs and sources that need one governance and discovery layer without migrations.
&lt;strong&gt;Notable:&lt;/strong&gt; “Catalog of catalogs” approach; can present Iceberg/Hive/Paimon/Hudi catalogs under one umbrella.&lt;/p&gt;
&lt;h3&gt;AWS Glue Data Catalog (+ Lake Formation)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; AWS’s managed Hive-compatible catalog with &lt;strong&gt;first-party governance&lt;/strong&gt; (Lake Formation) and native support for &lt;strong&gt;Iceberg/Delta/Hudi&lt;/strong&gt; tables in S3, consumed by Athena, EMR, Redshift Spectrum, and Glue jobs.
&lt;strong&gt;Where it shines:&lt;/strong&gt; All-in AWS lakehouses needing centralized metadata and fine-grained access control enforced across AWS analytics services.
&lt;strong&gt;Notable:&lt;/strong&gt; Managed, integrated, and convenient, cloud-specific by design.&lt;/p&gt;
&lt;h3&gt;Microsoft OneLake Catalog (Fabric)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; The &lt;strong&gt;central catalog&lt;/strong&gt; for Microsoft Fabric’s “OneLake”, a tenant-wide, Delta-native lake with unified discovery (“Explore”) and governance (“Govern”) experiences.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Fabric-centric stacks that want a single catalog for Spark, SQL, Power BI, and Real-Time Analytics over &lt;strong&gt;Delta&lt;/strong&gt; tables in ADLS/OneLake.
&lt;strong&gt;Notable:&lt;/strong&gt; Deeply integrated SaaS experience; shortcuts/mirroring help connect external sources, but it’s Azure/Fabric-scoped.&lt;/p&gt;
&lt;h3&gt;Google BigLake (Metastore + Iceberg)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; Google’s open lakehouse layer: &lt;strong&gt;BigLake Metastore&lt;/strong&gt; catalogs &lt;strong&gt;Iceberg&lt;/strong&gt; tables on GCS; BigQuery reads them natively while Spark/Flink and other engines use the &lt;strong&gt;Iceberg REST&lt;/strong&gt; interface.
&lt;strong&gt;Where it shines:&lt;/strong&gt; GCP stacks wanting warehouse-grade operations (BigQuery) over &lt;strong&gt;open Iceberg tables&lt;/strong&gt; stored in customer buckets with multi-engine access.
&lt;strong&gt;Notable:&lt;/strong&gt; Managed table maintenance and unified governance via Dataplex/BigLake; Iceberg-first approach.&lt;/p&gt;
&lt;h3&gt;Project Nessie&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A &lt;strong&gt;Git-like, transactional catalog&lt;/strong&gt; for data lakes that adds &lt;strong&gt;branches, tags, time-travel, and cross-table commits&lt;/strong&gt; on top of Iceberg.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Teams needing dev/test isolation, reproducibility, or multi-table atomic commits in an &lt;strong&gt;Iceberg&lt;/strong&gt; lakehouse.
&lt;strong&gt;Notable:&lt;/strong&gt; Works with Spark, Flink, Trino, Dremio; deploy anywhere (K8s/containers). Complements standard catalogs with versioning semantics.&lt;/p&gt;
&lt;h3&gt;Unity Catalog (Open Source)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; An &lt;strong&gt;open-sourced&lt;/strong&gt; universal catalog for data &amp;amp; AI assets with multi-format (Delta, &lt;strong&gt;Iceberg&lt;/strong&gt; via REST/UniForm, files) and multi-engine ambitions; compatible with &lt;strong&gt;Hive Metastore API&lt;/strong&gt; and &lt;strong&gt;Iceberg REST&lt;/strong&gt;.&lt;br&gt;
&lt;strong&gt;Where it shines:&lt;/strong&gt; Enterprises seeking &lt;strong&gt;broad governance&lt;/strong&gt; (tables, files, functions, ML models) and consistent policies across engines/clouds.
&lt;strong&gt;Notable:&lt;/strong&gt; Recent updates added external engine read GA and write preview for Iceberg via REST, expanding interoperability.&lt;/p&gt;
&lt;h3&gt;Lakekeeper&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A lightweight, &lt;strong&gt;Rust-based Apache Iceberg REST catalog&lt;/strong&gt; focused on speed, security (OIDC/OPA), and simplicity; Apache-licensed.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Teams wanting a small, fast &lt;strong&gt;Iceberg&lt;/strong&gt; catalog they can self-host, integrate with Trino/Spark, and plug into modern authz.
&lt;strong&gt;Notable:&lt;/strong&gt; Ecosystem-first design; good fit for DIY open lakehouses and CICD-style deployments.&lt;/p&gt;
&lt;h3&gt;Quick Guide: Picking the Right Catalog&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open, vendor-neutral Iceberg core:&lt;/strong&gt; &lt;em&gt;Polaris&lt;/em&gt; (add &lt;em&gt;Nessie&lt;/em&gt; if you need Git-style branching &amp;amp; multi-table commits).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Federate many sources/catalogs across regions/clouds:&lt;/strong&gt; &lt;em&gt;Gravitino&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deep cloud integration:&lt;/strong&gt; &lt;em&gt;Glue/Lake Formation&lt;/em&gt; (AWS), &lt;em&gt;OneLake Catalog&lt;/em&gt; (Azure/Fabric), &lt;em&gt;BigLake Metastore&lt;/em&gt; (GCP) or Dremio Catalog (Managed Polaris Service from Dremio).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad data &amp;amp; AI governance (tables + files + models):&lt;/strong&gt; &lt;em&gt;Unity Catalog (OSS)&lt;/em&gt;; growing multi-engine support including Iceberg REST.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; For multi-engine &lt;strong&gt;Iceberg&lt;/strong&gt; lakehouses, a common pattern is: &lt;strong&gt;Polaris as the primary REST catalog&lt;/strong&gt; for engines, with &lt;strong&gt;Nessie&lt;/strong&gt; layered in when you need branches/isolated environments. Cloud-native teams may still register those tables in their cloud catalogs for service-level features (e.g., Athena/BigQuery/Power BI), but keep the &lt;strong&gt;source of truth open&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Lakehouse Optimization&lt;/h2&gt;
&lt;p&gt;Once a lakehouse is in production, the focus shifts from building to &lt;strong&gt;sustaining performance and efficiency at scale&lt;/strong&gt;. Without ongoing optimization, query times creep up, storage costs balloon, and data reliability weakens. The key is to manage both &lt;strong&gt;physical data layout&lt;/strong&gt; and &lt;strong&gt;metadata growth&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Compaction and Small File Management&lt;/h3&gt;
&lt;p&gt;Frequent batch loads and streaming pipelines often generate thousands of small Parquet or ORC files.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Query engines spend more time opening files than scanning data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Table formats support compaction actions, rewriting many small files into fewer large ones (hundreds of MBs to 1GB).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Examples:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg’s &lt;code&gt;rewriteDataFiles&lt;/code&gt; action merges small files efficiently (Dremio also has an &lt;code&gt;OPTIMIZE&lt;/code&gt; command for Iceberg tables).&lt;/li&gt;
&lt;li&gt;Delta Lake offers the &lt;code&gt;OPTIMIZE&lt;/code&gt; command (with Z-Ordering for clustering).&lt;/li&gt;
&lt;li&gt;Hudi provides asynchronous background compaction for MoR tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snapshot Expiration and Metadata Cleanup&lt;/h3&gt;
&lt;p&gt;Modern formats keep snapshots for time travel, but unchecked, these create &lt;strong&gt;metadata bloat&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; &lt;code&gt;expireSnapshots&lt;/code&gt; safely removes old snapshots and associated data files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; &lt;code&gt;VACUUM&lt;/code&gt; cleans up unreferenced files after a retention period.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Timeline service supports configurable retention of commits and delta logs.&lt;br&gt;
Regular cleanup keeps both storage and query planning efficient.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Partitioning and Clustering&lt;/h3&gt;
&lt;p&gt;Good partition design reduces data scanned per query.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Hidden partitions abstract complexity away from end users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta:&lt;/strong&gt; Z-Ordering clusters data across dimensions for multidimensional pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Can cluster records within files to optimize MoR query performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Partition evolution, changing partition strategy over time without breaking old data, is now supported by most formats and prevents schema rigidity.&lt;/p&gt;
&lt;h3&gt;Query Acceleration&lt;/h3&gt;
&lt;p&gt;Beyond storage optimization, &lt;strong&gt;query acceleration&lt;/strong&gt; techniques deliver speed at scale.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&apos;s Reflections and materialized views&lt;/strong&gt; in platforms like Dremio provide always-fresh, cache-like performance boosts without manual tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column stats and bloom filters&lt;/strong&gt; stored in metadata allow engines to skip files entirely when filters exclude them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized execution and Arrow-based memory models&lt;/strong&gt; reduce CPU costs across query engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Format-Specific Optimizations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Manifest merging, hidden partitioning, and metadata caching.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta:&lt;/strong&gt; Frequent checkpoints and file skipping using data skipping indexes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Incremental queries for consuming only new changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Continuous compaction to reconcile streaming write amplification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Lakehouse optimization is not a one-off task, it’s an ongoing discipline. By managing file sizes, pruning metadata, evolving partitions, and using acceleration features, organizations keep performance predictable and costs controlled, even as data volumes and workloads scale into 2026.&lt;/p&gt;
&lt;h2&gt;The Intelligent Data Lakehouse Built for Agentic AI with Dremio&lt;/h2&gt;
&lt;p&gt;By the end of 2025, the conversation about data platforms shifted from “how do we manage data?” to “how do we make data &lt;strong&gt;intelligent and AI-ready&lt;/strong&gt;?” This is where the &lt;strong&gt;intelligent lakehouse&lt;/strong&gt; comes in, and where Dremio stands out as the reference implementation.&lt;/p&gt;
&lt;h3&gt;From Static Analytics to Agentic Workloads&lt;/h3&gt;
&lt;p&gt;Traditional BI queries are predictable: weekly reports, dashboards, and KPIs. AI-driven workloads are not. &lt;strong&gt;Agentic AI systems&lt;/strong&gt;, large language models and autonomous agents, generate dynamic, ad-hoc queries that span datasets in unpredictable ways. This requires:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Consistent low-latency responses.&lt;/li&gt;
&lt;li&gt;A platform that can optimize itself without human intervention.&lt;/li&gt;
&lt;li&gt;Seamless integration between structured data, semantic meaning, and AI agents.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Dremio as the Intelligent Lakehouse&lt;/h3&gt;
&lt;p&gt;Dremio is more than a query engine; it’s a &lt;strong&gt;self-optimizing lakehouse platform&lt;/strong&gt; built natively on Apache Iceberg and Arrow. Key capabilities include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflections:&lt;/strong&gt; Always-fresh materializations that accelerate queries automatically. Unlike traditional materialized views, reflections are invisible to end users, the optimizer decides when to use them, making acceleration adaptive to changing workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Layer:&lt;/strong&gt; A unified place to define datasets, metrics, and business concepts. This ensures that whether it’s an analyst writing SQL or an AI agent generating queries, results remain consistent and governed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI-Ready APIs:&lt;/strong&gt; Through Arrow Flight and REST endpoints, Dremio streams data directly into Python, notebooks, or AI frameworks with zero-copy efficiency. This bridges the gap between analytics and machine learning pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Standards:&lt;/strong&gt; By embracing Iceberg, Polaris (for catalogs), and Arrow, Dremio ensures interoperability, your AI agents or external engines can interact with the same governed data without lock-in.&lt;/li&gt;
&lt;li&gt;All the above allow Agentic AI applications connecting to Dremio through Dremio&apos;s MCP server successful in enabling Agentic Analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why This Matters for Agentic AI&lt;/h3&gt;
&lt;p&gt;AI agents thrive on &lt;strong&gt;autonomy and adaptability&lt;/strong&gt;. They need a platform that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Handles &lt;strong&gt;ever-changing queries&lt;/strong&gt; without brittle pre-optimizations.&lt;/li&gt;
&lt;li&gt;Keeps acceleration aligned with shifting patterns (autonomous reflections).&lt;/li&gt;
&lt;li&gt;Provides &lt;strong&gt;governed access&lt;/strong&gt; so that AI doesn’t hallucinate unauthorized or inconsistent definitions of metrics.&lt;/li&gt;
&lt;li&gt;Scales seamlessly from small exploratory prompts to massive training-data extractions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt; Dremio delivers the intelligent lakehouse, a platform that not only stores and serves data but actively &lt;strong&gt;adapts to how humans and AI consume it&lt;/strong&gt;. As agentic AI moves from hype to everyday practice in 2026, this intelligence layer will be the key to transforming raw data into reliable, actionable, and AI-ready insights.&lt;/p&gt;
&lt;h2&gt;Python for the Lakehouse&lt;/h2&gt;
&lt;p&gt;Python has become the lingua franca of modern data engineering and data science, and the lakehouse ecosystem is no exception. By 2026, a rich set of Python-first tools and frameworks have emerged that make it easier to ingest, process, analyze, and serve data directly from open table formats like Apache Iceberg, Delta, and Hudi. These tools not only enable lightweight experimentation but also power production-grade pipelines that rival traditional big data stacks.&lt;/p&gt;
&lt;h3&gt;DuckDB&lt;/h3&gt;
&lt;p&gt;Often described as the “SQLite for analytics,” &lt;strong&gt;DuckDB&lt;/strong&gt; is an in-process analytical database that excels at local workloads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Direct Parquet &amp;amp; Iceberg Reads:&lt;/strong&gt; DuckDB can query Parquet files and integrate with Iceberg catalogs, making it a natural fit for small-to-medium lakehouse use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Speed:&lt;/strong&gt; Its vectorized execution engine makes it extremely fast for analytical queries on a single machine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python Integration:&lt;/strong&gt; Native bindings allow seamless use within notebooks or Python apps.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;DuckDB has become the go-to for prototyping, ad hoc exploration, and embedding analytics directly into applications.&lt;/p&gt;
&lt;h3&gt;Dask&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Dask&lt;/strong&gt; is a parallel computing framework for Python that scales workflows from laptops to clusters.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Flexible API:&lt;/strong&gt; Works with familiar NumPy, pandas, and scikit-learn APIs while distributing workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakehouse Integration:&lt;/strong&gt; Reads and writes Parquet, and combined with Iceberg connectors, it enables scalable transformations on lakehouse data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Fit:&lt;/strong&gt; Useful for machine learning preprocessing and large-scale data transformations where Spark might be overkill.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dask democratizes distributed compute for teams already invested in Python.&lt;/p&gt;
&lt;h3&gt;Daft&lt;/h3&gt;
&lt;p&gt;A newer entrant, &lt;strong&gt;Daft&lt;/strong&gt; positions itself as a distributed data processing engine optimized for AI and ML workloads.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Arrow-Native:&lt;/strong&gt; Built on Apache Arrow for fast columnar in-memory processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible Backends:&lt;/strong&gt; Runs locally or on clusters, supporting both CPUs and GPUs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakehouse Ready:&lt;/strong&gt; Reads directly from Parquet and Iceberg sources, enabling high-performance pipelines that integrate analytics and ML training.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Daft is gaining traction for teams that want a modern, Pythonic alternative to Spark for big data and AI-centric workflows.&lt;/p&gt;
&lt;h3&gt;Bauplan&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Bauplan Labs&lt;/strong&gt; brings a &lt;em&gt;serverless, Python-first lakehouse&lt;/em&gt; approach.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pipeline-as-Code:&lt;/strong&gt; Data pipelines are written in Python and executed in a serverless runtime that scales automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control for Data:&lt;/strong&gt; Bauplan integrates Iceberg tables with Git-like branching via catalogs like Nessie, making schema and data versioning first-class features.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developer Experience:&lt;/strong&gt; With Arrow under the hood, Bauplan emphasizes reproducibility, modular pipelines, and minimal infrastructure overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Bauplan is designed for teams that want the power of the lakehouse without the complexity of managing heavy infrastructure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DuckDB&lt;/strong&gt; is the Swiss Army knife for local analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dask&lt;/strong&gt; scales familiar Python workflows across clusters.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Daft&lt;/strong&gt; brings Arrow-native distributed compute optimized for AI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bauplan&lt;/strong&gt; simplifies pipeline execution with a serverless lakehouse model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Together, these tools give Python along other libraries like Polaris, Ibis, SQLFrame and others developers an end-to-end toolkit for building, maintaining, and consuming modern data lakehouses.&lt;/p&gt;
&lt;h2&gt;Graphs in the Data Lakehouse with PuppyGraph&lt;/h2&gt;
&lt;p&gt;While the data lakehouse excels at tabular and relational analytics, many real-world problems are &lt;strong&gt;graph-shaped&lt;/strong&gt;: fraud rings, identity networks, supply chains, lineage tracking, and recommendation systems. Traditionally, these problems required loading data into a &lt;strong&gt;specialized graph database&lt;/strong&gt;, an extra layer of ETL and storage that added cost and complexity. &lt;strong&gt;PuppyGraph&lt;/strong&gt; changes this equation by bringing graph analytics &lt;strong&gt;directly into the lakehouse.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;What is PuppyGraph?&lt;/h3&gt;
&lt;p&gt;PuppyGraph is a &lt;strong&gt;cloud-native graph engine&lt;/strong&gt; designed to run on top of existing data in your lakehouse. Instead of requiring a proprietary graph database, PuppyGraph lets you &lt;strong&gt;query your Iceberg, Delta, Hudi, or Hive tables as a graph&lt;/strong&gt;. It connects directly to open table formats, relational databases, and warehouses, automatically sharding and scaling queries without duplicating data. This means you can turn your existing datasets into a &lt;strong&gt;graph model in minutes&lt;/strong&gt;, with no ETL.&lt;/p&gt;
&lt;h3&gt;Integration with the Lakehouse&lt;/h3&gt;
&lt;p&gt;PuppyGraph integrates seamlessly with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; (including REST catalogs like Tabular or Polaris)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake and Apache Hudi&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hive Metastore and AWS Glue&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databases and Warehouses&lt;/strong&gt; such as PostgreSQL, MySQL, Redshift, BigQuery, and DuckDB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each source is treated as a &lt;em&gt;catalog&lt;/em&gt;. PuppyGraph lets you define a &lt;strong&gt;graph schema&lt;/strong&gt; across one or many catalogs, effectively federating multiple data sources into a single graph. For example, you can link customer nodes in PostgreSQL with transaction edges in Iceberg, &lt;strong&gt;all without moving the data.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Querying Graphs at Scale&lt;/h3&gt;
&lt;p&gt;Because PuppyGraph queries &lt;strong&gt;directly against Parquet-backed tables&lt;/strong&gt;, you can run multi-hop traversals and graph algorithms over your lakehouse data. It supports popular graph query languages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Gremlin (Apache TinkerPop)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;openCypher&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This ensures compatibility with existing graph tooling and reduces the learning curve. Performance is optimized for &lt;strong&gt;large, complex traversals&lt;/strong&gt;: PuppyGraph has demonstrated &lt;strong&gt;6-hop traversals over hundreds of millions of edges in under a second&lt;/strong&gt;. Cached mode allows even faster repeated queries, often surpassing the performance of traditional graph databases.&lt;/p&gt;
&lt;h3&gt;Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fraud Detection:&lt;/strong&gt; Traverse transaction graphs in real time to uncover hidden fraud rings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cybersecurity:&lt;/strong&gt; Model logins, access patterns, and network flows as a graph to detect threats.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supply Chain Optimization:&lt;/strong&gt; Connect suppliers, shipments, and logistics into a graph for bottleneck analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Customer 360:&lt;/strong&gt; Combine relational and behavioral data into a graph to better understand customer journeys.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graph + AI:&lt;/strong&gt; PuppyGraph supports &lt;strong&gt;Graph RAG (Retrieval Augmented Generation)&lt;/strong&gt;, enabling LLMs and agents to query structured relationships for better context and reasoning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why It Matters&lt;/h3&gt;
&lt;p&gt;By &lt;strong&gt;plugging directly into the lakehouse&lt;/strong&gt;, PuppyGraph removes the wall between tabular and graph analytics. Data engineers and architects can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Avoid &lt;strong&gt;data duplication and ETL pipelines&lt;/strong&gt; into separate graph stores.&lt;/li&gt;
&lt;li&gt;Keep governance and security consistent via existing catalogs.&lt;/li&gt;
&lt;li&gt;Support &lt;strong&gt;SQL, BI, and graph queries side-by-side&lt;/strong&gt; on the same data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, PuppyGraph makes the lakehouse not just the foundation for relational analytics and AI, but also a &lt;strong&gt;native home for graph workloads&lt;/strong&gt;, all with the same open formats and scalable storage.&lt;/p&gt;
&lt;h2&gt;Edge Inference for the Lakehouse: Spice AI&lt;/h2&gt;
&lt;p&gt;As organizations embrace AI-driven applications, the &lt;strong&gt;edge&lt;/strong&gt; has become a critical deployment target. Instead of sending all data to centralized clusters, inference can increasingly happen &lt;strong&gt;close to where data is generated&lt;/strong&gt;, IoT devices, factories, mobile applications, or regional data centers. The lakehouse, traditionally viewed as a central hub, is now extending outward. Platforms like &lt;strong&gt;Spice AI&lt;/strong&gt; make this possible.&lt;/p&gt;
&lt;h3&gt;Why Edge Inference Matters&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Inference needs to happen in milliseconds, not seconds. Shipping every query to the cloud adds unacceptable delays for use cases like predictive maintenance or fraud detection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost Efficiency:&lt;/strong&gt; Processing locally reduces bandwidth and cloud compute costs, especially when dealing with high-volume sensor or event data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resilience:&lt;/strong&gt; Edge inference continues to function even with intermittent network connectivity, syncing back to the lakehouse when available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy &amp;amp; Compliance:&lt;/strong&gt; Processing data locally helps meet regulatory requirements by minimizing the movement of sensitive information.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Spice AI at the Edge&lt;/h3&gt;
&lt;p&gt;Spice AI positions itself as an &lt;strong&gt;operational data lakehouse&lt;/strong&gt; tailored for real-time and AI workloads. At the edge, this means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Federated Querying with DataFusion:&lt;/strong&gt; Spice uses the Rust-based DataFusion engine (part of the Arrow ecosystem) to execute high-performance queries locally. This allows lightweight nodes to join, filter, and aggregate data directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector + Relational Search:&lt;/strong&gt; Spice combines vector search (for embeddings) with SQL-style queries. This means an edge application can run both semantic AI lookups and structured analytics in one step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lightweight Runtimes:&lt;/strong&gt; Spice can run in containers or edge environments, consuming a small footprint while still supporting open table formats like Iceberg, Delta, and Hudi.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid Sync:&lt;/strong&gt; Results and inferences can be materialized locally, then synchronized back to the central lakehouse when connectivity is restored, ensuring global consistency without sacrificing local responsiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Manufacturing IoT:&lt;/strong&gt; Edge devices monitor sensor streams, detect anomalies with on-device inference, and sync flagged events to the lakehouse for broader analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retail:&lt;/strong&gt; In-store applications recommend products in real time based on customer behavior while syncing aggregated insights centrally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Telecom/5G:&lt;/strong&gt; Local edge inference supports real-time network optimization while global models are trained and governed in the lakehouse.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt; Edge inference extends the reach of the lakehouse from the cloud to the edge, enabling AI applications to be both &lt;strong&gt;real-time and governed&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;DuckLake: Simplifying Lakehouse Metadata with SQL&lt;/h2&gt;
&lt;p&gt;While formats like Iceberg, Delta, and Hudi advanced the lakehouse by bringing ACID transactions to data lakes, they also introduced operational complexity: JSON manifests, Avro metadata files, separate catalog services, and eventual consistency challenges. &lt;strong&gt;DuckLake&lt;/strong&gt; takes a fresh approach by asking a simple question: &lt;em&gt;what if the entire metadata layer was just stored in a relational database?&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;What is DuckLake?&lt;/h3&gt;
&lt;p&gt;DuckLake is a new open table format developed by the DuckDB team. Its core idea is to &lt;strong&gt;move all catalog and table metadata into a SQL database&lt;/strong&gt;, while keeping table data as Parquet files in object storage or local filesystems. This means no manifest lists, no Hive Metastore, and no extra catalog API services - just SQL tables that track schemas, snapshots, and file pointers.&lt;/p&gt;
&lt;h3&gt;Architecture&lt;/h3&gt;
&lt;p&gt;DuckLake splits the lakehouse into two layers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalog Database:&lt;/strong&gt; Any ACID-compliant database (DuckDB, SQLite, Postgres, MySQL, or even MotherDuck) stores all metadata - schemas, table versions, statistics, and transactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Layer:&lt;/strong&gt; Standard Parquet files (and optional delete files) stored in directories or S3 buckets hold the actual table data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design yields &lt;strong&gt;fast commits&lt;/strong&gt; (a single SQL transaction to update metadata), &lt;strong&gt;strong consistency&lt;/strong&gt; (no reliance on eventually consistent file stores), and &lt;strong&gt;simpler operations&lt;/strong&gt; (just back up or replicate the metadata DB). It also enables advanced features like &lt;strong&gt;multi-table transactions&lt;/strong&gt;, &lt;strong&gt;time travel&lt;/strong&gt;, and &lt;strong&gt;transactional schema changes&lt;/strong&gt; without a complex stack.&lt;/p&gt;
&lt;h3&gt;Integration with DuckDB&lt;/h3&gt;
&lt;p&gt;DuckLake ships as a DuckDB extension. Once installed, users can:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSTALL ducklake;
LOAD ducklake;
ATTACH &apos;ducklake:mycatalog.ducklake&apos; AS lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From there, you can create tables, insert, update, delete, and query with full ACID guarantees. Multiple DuckDB instances can share the same DuckLake if the catalog is in a multi-user database like Postgres, effectively making DuckDB “multiplayer” with a shared lakehouse.&lt;/p&gt;
&lt;h3&gt;Interoperability&lt;/h3&gt;
&lt;p&gt;DuckLake is its own format, but it’s designed to interoperate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Parquet and delete files are compatible, and DuckLake can import Iceberg metadata directly, even preserving snapshot history.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delta:&lt;/strong&gt; DuckDB continues to support Delta Lake separately; data can be copied between Delta and DuckLake when needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Not natively supported yet, but Hudi’s Parquet files can be queried as plain Parquet.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes DuckLake a flexible companion in mixed-format environments and a potential bridge for migrating or experimenting.&lt;/p&gt;
&lt;h3&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;Local &amp;amp; Embedded Lakehouses: Run a mini data warehouse on your laptop with DuckDB + DuckLake, no heavy services required.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Small Team Data Warehouses:&lt;/strong&gt; Share a DuckLake catalog in Postgres for concurrent analytics across a team.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streaming &amp;amp; CDC:&lt;/strong&gt; Handle high-frequency small writes efficiently without metadata file bloat.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CI/CD Pipelines:&lt;/strong&gt; Spin up ephemeral lakehouses in tests with time travel and rollback for validation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Limitations &amp;amp; Roadmap&lt;/h3&gt;
&lt;p&gt;DuckLake is still young (v0.3 as of late 2025). At present:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Ecosystem support is centered on DuckDB/MotherDuck.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;No built-in fine-grained governance; relies on the underlying DB’s permissions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;No branching/merge semantics like Project Nessie, though time travel is supported..&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Books on the Data Lakehouse and Open Table Formats&lt;/h2&gt;
&lt;p&gt;For data engineers and architects who want to go beyond blogs and documentation, books provide the depth and structured learning needed to master the lakehouse paradigm. Between 2023 and early 2026, O’Reilly, Manning, and Packt have released (or announced) a range of titles that cover the architecture, theory, and practice of the data lakehouse, including the major open table formats, Apache Iceberg, Delta Lake, and Apache Hudi.&lt;/p&gt;
&lt;h3&gt;O’Reilly Media&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/&quot;&gt;&lt;strong&gt;Apache Iceberg: The Definitive Guide – Data Lakehouse Functionality, Performance, and Scalability on the Data Lake&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Tomer Shiran, Jason Hughes, and Alex Merced&lt;/em&gt; (Jun 2024)&lt;br&gt;
Comprehensive deep dive into Apache Iceberg’s architecture, metadata model, features like partition evolution and time travel, and integrations across engines such as Spark, Flink, Trino, and Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/apache-polaris-the/9798341608139/&quot;&gt;&lt;strong&gt;Apache Polaris: The Definitive Guide – Enriching Apache Iceberg Lakehouse with a robust open-source catalog&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Alex Merced, and Andrew Madson&lt;/em&gt; (Jun 2024)&lt;br&gt;
Revolutionize your understanding of modern data management with Apache Polaris (incubating), the open source catalog designed for data lakehouse industry standard Apache Iceberg. This comprehensive guide takes you on a journey through the intricacies of Apache Iceberg data lakehouses, highlighting the pivotal role of Iceberg catalogs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/delta-lake-up/9781098139711/&quot;&gt;&lt;strong&gt;Delta Lake: Up and Running – Modern Data Lakehouse Architectures with Delta Lake&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Bennie Haelen and Dan Davis&lt;/em&gt; (Oct 2023)&lt;br&gt;
Introductory and practical guide to Delta Lake, covering ACID transactions, schema enforcement, time travel, and how to build reliable data pipelines that unify batch and streaming.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/delta-lake-the/9781098151010/&quot;&gt;&lt;strong&gt;Delta Lake: The Definitive Guide – Modern Data Lakehouse Architectures with Data Lakes&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu&lt;/em&gt; (Dec 2024)&lt;br&gt;
Written by core Delta Lake contributors, this book explores Delta’s transaction log, medallion architecture, deletion vectors, and advanced optimization strategies for enterprise-scale workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/practical-lakehouse-architecture/9781098156145/&quot;&gt;&lt;strong&gt;Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Gaurav Ashok Thalpati&lt;/em&gt; (Aug 2024)&lt;br&gt;
A broad architectural guide to designing, implementing, and migrating to lakehouse platforms. Covers design layers, governance, catalogs, and security with a practical step-by-step framework.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/apache-hudi-the/9781098173821/&quot;&gt;&lt;strong&gt;Apache Hudi: The Definitive Guide – Building Robust, Open, and High-Performance Lakehouses&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Shiyan Xu, Prashant Wason, Sudha Saktheeswaran, and Rebecca Bilbro&lt;/em&gt; (Forthcoming Dec 2025)&lt;br&gt;
Focuses on Hudi’s approach to incremental processing, upserts/deletes, clustering, and indexing. Demonstrates how to run production-ready lakehouses with streaming data ingestion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Manning Publications&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;&lt;strong&gt;Architecting an Apache Iceberg Lakehouse&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Alex Merced&lt;/em&gt; (MEAP, 2025 – Forthcoming 2026)&lt;br&gt;
A hands-on, architecture-first guide to designing scalable Iceberg-based lakehouses. Covers all five layers (storage, table formats, ingestion, catalog, consumption) with exercises and real-world design trade-offs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Packt Publishing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.packtpub.com/en-us/product/building-modern-data-applications-using-databricks-lakehouse-9781804617205&quot;&gt;&lt;strong&gt;Building Modern Data Applications Using Databricks Lakehouse&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Will Girten&lt;/em&gt; (Oct 2024)&lt;br&gt;
Practical guide to deploying end-to-end pipelines on Databricks Lakehouse using Delta Lake and Unity Catalog, including batch and streaming workflows, governance, and CI/CD.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.packtpub.com/en-us/product/engineering-lakehouses-with-open-table-formats-9781836207221&quot;&gt;&lt;strong&gt;Engineering Lakehouses with Open Table Formats&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Dipankar Mazumdar and Vinoth Govindarajan&lt;/em&gt; (Dec 2025)&lt;br&gt;
Covers Iceberg, Hudi, and Delta Lake together, focusing on how to choose between them, optimize tables, and build interoperable, vendor-agnostic architectures. Includes hands-on examples with Spark, Flink, and Trino.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.packtpub.com/en-us/product/data-engineering-with-databricks-cookbook-9781803246147&quot;&gt;&lt;strong&gt;Data Engineering with Databricks Cookbook&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Pulkit Chadha&lt;/em&gt; (May 2024)&lt;br&gt;
Recipe-based approach to building data pipelines on Databricks, with step-by-step instructions for managing Delta Lake tables, handling streaming ingestion, orchestrating workflows, and applying Unity Catalog governance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Takeaway&lt;/h3&gt;
&lt;p&gt;Whether you’re looking for a &lt;strong&gt;deep dive into a specific format&lt;/strong&gt; (Iceberg, Delta, or Hudi) or a &lt;strong&gt;broader perspective on lakehouse architecture&lt;/strong&gt;, these titles form the essential reading list for data engineers and architects in 2026. They not only document the current state of the technology but also provide practical frameworks and best practices to implement reliable, scalable, and open lakehouses.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The journey from &lt;strong&gt;data warehouses&lt;/strong&gt; to &lt;strong&gt;data lakes&lt;/strong&gt; and finally to the &lt;strong&gt;data lakehouse&lt;/strong&gt; reflects one constant: organizations need a platform that balances &lt;strong&gt;trust, flexibility, and performance&lt;/strong&gt;. Warehouses gave us governance but lacked agility. Lakes gave us scale and freedom but sacrificed reliability. The lakehouse unites these worlds by layering open table formats, catalogs, and intelligent query engines on top of low-cost object storage.&lt;/p&gt;
&lt;p&gt;By 2025, this model matured from a promise into a proven architecture. With formats like &lt;strong&gt;Apache Iceberg, Delta Lake, Hudi, and Paimon&lt;/strong&gt;, data teams now have open standards for transactional data at scale. Streaming-first ingestion, autonomous optimization, and catalog-driven governance have become baseline requirements. Looking ahead to 2026, the lakehouse is no longer just a central repository, it extends outward to power &lt;strong&gt;real-time analytics, agentic AI, and even edge inference&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For data engineers and architects, the message is clear:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adopt &lt;strong&gt;open table formats&lt;/strong&gt; to avoid lock-in and ensure interoperability.&lt;/li&gt;
&lt;li&gt;Embrace a &lt;strong&gt;layered architecture&lt;/strong&gt; that separates storage, metadata, ingestion, catalog, and consumption.&lt;/li&gt;
&lt;li&gt;Optimize continuously, through compaction, snapshot expiration, and acceleration features, so performance scales with data.&lt;/li&gt;
&lt;li&gt;Prepare for the future where &lt;strong&gt;AI workloads are not occasional but constant&lt;/strong&gt;, demanding a platform that is both intelligent and adaptive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The lakehouse has become the backbone of modern data platforms. As you step into 2026, building on this foundation isn’t just a best practice, it’s the path to delivering data that is truly &lt;strong&gt;trusted, governed, and AI-ready&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Composable Analytics with Agents -  Leveraging Virtual Datasets and the Semantic Layer</title><link>https://iceberglakehouse.com/posts/2025-09-composable-analytics-with-agents/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-composable-analytics-with-agents/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Wed, 17 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=semantic_layer&amp;amp;utm_content=alexmerced&amp;amp;utm_term=semantic_layer&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=semantic_layer&amp;amp;utm_content=alexmerced&amp;amp;utm_term=semantic_layer&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=semantic_layer&amp;amp;utm_content=alexmerced&amp;amp;utm_term=semantic_layer&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0?utm_source=merced&amp;amp;utm_medium=affiliate&amp;amp;utm_campaign=book_merced&amp;amp;a_aid=merced&amp;amp;a_bid=7eac4151&quot;&gt;Purchase &amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot;&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The promise of AI in analytics isn’t just faster answers, it’s &lt;strong&gt;smarter, more flexible insights&lt;/strong&gt;. For that to happen, AI agents need not only access to data but also the ability to compose, extend, and recombine datasets on the fly. This is where Dremio’s &lt;strong&gt;semantic layer&lt;/strong&gt; and &lt;strong&gt;virtual datasets&lt;/strong&gt; come into play, providing the foundation for what AtScale calls &lt;em&gt;composable analytics&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;The Challenge: Static Models in a Dynamic World&lt;/h2&gt;
&lt;p&gt;Traditional analytics models are rigid. Business intelligence teams define metrics in dashboards or cubes, and changing them often requires IT involvement. This creates bottlenecks when business needs evolve, leaving AI agents with limited flexibility to adjust their workflows.&lt;/p&gt;
&lt;p&gt;For agentic AI, which thrives on &lt;strong&gt;iterative reasoning and adaptive workflows&lt;/strong&gt;, rigid models are a barrier.&lt;/p&gt;
&lt;h2&gt;Virtual Datasets: Building Blocks for Composable Analytics&lt;/h2&gt;
&lt;p&gt;Dremio addresses this challenge with &lt;strong&gt;virtual datasets (VDSs)&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No physical copies&lt;/strong&gt;: VDSs are views defined in the semantic layer, not duplicated data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Composable&lt;/strong&gt;: VDSs can be combined, extended, or refined into new virtual models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Governed&lt;/strong&gt;: Every dataset inherits security and lineage from the semantic layer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agents interacting through Dremio’s MCP server can query these VDSs directly, creating new analytic combinations without breaking governance or requiring new pipelines.&lt;/p&gt;
&lt;h2&gt;Agents + MCP: Extending Models on Demand&lt;/h2&gt;
&lt;p&gt;With MCP exposing tools like &lt;em&gt;Run SQL Query&lt;/em&gt; and &lt;em&gt;Run Semantic Search&lt;/em&gt;, agents can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Discover governed VDSs in &lt;strong&gt;plain business language&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Combine datasets to answer multi-dimensional questions.&lt;/li&gt;
&lt;li&gt;Extend existing models with new calculations or filters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, an agent could take a “Customer Revenue” VDS and extend it with a churn prediction metric, producing a new analytic model for marketing, all governed by Dremio’s semantic layer.&lt;/p&gt;
&lt;h2&gt;Composable Analytics Meets Composable Modeling&lt;/h2&gt;
&lt;p&gt;The AtScale community describes &lt;em&gt;composable analytics&lt;/em&gt; as the ability to assemble insights from modular building blocks. Dremio’s semantic layer aligns perfectly with this vision:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reusability&lt;/strong&gt;: Metrics and datasets defined once can be reused everywhere.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-functional consistency&lt;/strong&gt;: Finance, marketing, and operations share the same definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent empowerment&lt;/strong&gt;: AI systems don’t just query data : they can compose new insights dynamically.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This brings composability from the human analyst’s world into the AI agent’s world.&lt;/p&gt;
&lt;h2&gt;Real-World Benefits&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster iteration&lt;/strong&gt;: Agents adapt models to new questions without waiting for IT.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Democratized insights&lt;/strong&gt;: Business teams get answers in language they understand, grounded in governed metrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-functional alignment&lt;/strong&gt;: Everyone :  human or agent ,  works from the same semantic foundation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is analytics that are not only AI-ready but also &lt;strong&gt;flexible, governed, and consistent across the enterprise&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Composable analytics is the future of data-driven decision-making. By leveraging &lt;strong&gt;virtual datasets&lt;/strong&gt; and the &lt;strong&gt;semantic layer&lt;/strong&gt;, Dremio makes it possible for both humans and AI agents to build and extend insights in real time.&lt;/p&gt;
&lt;p&gt;With MCP providing the bridge and the semantic layer ensuring governance, enterprises can embrace a world where &lt;strong&gt;analytics are adaptive, modular, and truly agentic&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Endgame  – Building an Autonomous Optimization Pipeline for Apache Iceberg</title><link>https://iceberglakehouse.com/posts/iceberg-autonomous-optimization-pipeline/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-autonomous-optimization-pipeline/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 16 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;The Endgame : Building an Autonomous Optimization Pipeline for Apache Iceberg&lt;/h1&gt;
&lt;p&gt;Over the past nine posts, we’ve walked through the strategies, techniques, and tools you can use to keep your Apache Iceberg tables optimized for performance, cost, and reliability. Now, it’s time to put it all together.&lt;/p&gt;
&lt;p&gt;In this final post of the series, we’ll explore how to build an &lt;strong&gt;autonomous optimization pipeline&lt;/strong&gt;: a system that intelligently monitors your Iceberg tables and triggers the right actions automatically, without manual intervention.&lt;/p&gt;
&lt;h2&gt;What Does Autonomous Optimization Look Like?&lt;/h2&gt;
&lt;p&gt;An autonomous pipeline for Iceberg optimization should:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Continuously monitor table metadata&lt;/li&gt;
&lt;li&gt;Detect symptoms of degradation (e.g., small files, bloated manifests)&lt;/li&gt;
&lt;li&gt;Dynamically trigger the right optimization actions&lt;/li&gt;
&lt;li&gt;Recover gracefully from failure&lt;/li&gt;
&lt;li&gt;Integrate seamlessly with ingestion and query operations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes your lakehouse &lt;strong&gt;self-healing&lt;/strong&gt;, scalable, and easier to maintain - especially across many datasets.&lt;/p&gt;
&lt;h2&gt;Core Components of the Pipeline&lt;/h2&gt;
&lt;h3&gt;1. &lt;strong&gt;Metadata Intelligence Layer&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Leverage Iceberg’s built-in metadata tables to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyze file sizes and counts&lt;/li&gt;
&lt;li&gt;Track snapshot growth&lt;/li&gt;
&lt;li&gt;Monitor partition health&lt;/li&gt;
&lt;li&gt;Flag layout drift (e.g., outdated sort orders or clustering)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example diagnostic query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition, COUNT(*) AS file_count, AVG(file_size_in_bytes) AS avg_file_size
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) &amp;gt; 20 AND AVG(file_size_in_bytes) &amp;lt; 128000000;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This layer becomes the decision-maker for whether compaction or cleanup is needed.&lt;/p&gt;
&lt;h3&gt;2. Orchestration Layer&lt;/h3&gt;
&lt;p&gt;Use a scheduling tool like Airflow, Dagster, or dbt Cloud to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Run diagnostic checks on a schedule&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Execute Spark/Flink optimization jobs conditionally&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Log and track outcomes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Handle retries and alerting&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A sample DAG might include:&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;check_small_files task&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;trigger_compaction task&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;expire_snapshots task&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rewrite_manifests task&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each can be run only if certain thresholds are met.&lt;/p&gt;
&lt;h3&gt;3. Execution Layer&lt;/h3&gt;
&lt;p&gt;Trigger physical optimizations using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Spark actions (RewriteDataFiles, ExpireSnapshots, RewriteManifests)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Flink background jobs (especially for streaming pipelines)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dremio OPTIMIZE and VACUUM&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All actions should be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Scoped to affected partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tuned for parallelism&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Capable of partial progress&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Observability and Logging&lt;/h3&gt;
&lt;p&gt;Feed metrics into dashboards and alerts using tools like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Prometheus + Grafana&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Datadog&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;CloudWatch&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Track:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Number of files compacted&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Snapshots expired&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Runtime per job&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Failed vs succeeded partitions&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows you to adjust thresholds and tuning parameters over time.&lt;/p&gt;
&lt;h3&gt;5. Storage Cleanup (GC)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;After snapshots are expired, unreferenced files need to be deleted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Ensure cleanup happens after expiration jobs, not in parallel.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of an Autonomous Pipeline&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Consistent Performance:&lt;/strong&gt; Tables stay fast without manual tuning&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational Efficiency:&lt;/strong&gt; No more ad hoc optimization jobs&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Works across 10 tables or 10,000 tables&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance-Ready:&lt;/strong&gt; All changes are tracked, repeatable, and policy-driven&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s flexibility and rich metadata layer make it uniquely suited to autonomous data management. By combining:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Real-time metadata insight&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Targeted optimization strategies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Smart orchestration&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Catalog-aware execution&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can build a lakehouse that optimizes itself - freeing your data team to focus on innovation, not maintenance.&lt;/p&gt;
&lt;h2&gt;Where to Go from Here&lt;/h2&gt;
&lt;p&gt;If you’ve followed this series from the beginning, you now have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A deep understanding of how Iceberg tables degrade&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tools to address compaction, clustering, and metadata bloat&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The blueprint for a modern, self-tuning optimization pipeline&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thanks for reading - and keep building faster, cleaner, and smarter Iceberg lakehouses.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Managing Large-Scale Optimizations  – Parallelism, Checkpointing, and Fail Recovery</title><link>https://iceberglakehouse.com/posts/iceberg-large-scale-optimization/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-large-scale-optimization/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 09 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Managing Large-Scale Optimizations : Parallelism, Checkpointing, and Fail Recovery&lt;/h1&gt;
&lt;p&gt;When working with Apache Iceberg at scale, optimization jobs can become heavy and time-consuming. Rewriting thousands of files, scanning massive partitions, and coordinating metadata updates requires careful execution planning - especially in environments with limited compute or strict SLAs.&lt;/p&gt;
&lt;p&gt;In this post, we’ll look at strategies for making compaction and metadata cleanup operations &lt;strong&gt;scalable, resilient, and efficient&lt;/strong&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tuning parallelism&lt;/li&gt;
&lt;li&gt;Using partition pruning&lt;/li&gt;
&lt;li&gt;Applying checkpointing for long-running jobs&lt;/li&gt;
&lt;li&gt;Handling failures safely and automatically&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Why Scaling Optimization Matters&lt;/h2&gt;
&lt;p&gt;As your Iceberg tables grow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;File counts increase&lt;/li&gt;
&lt;li&gt;Partition cardinality rises&lt;/li&gt;
&lt;li&gt;Manifest files balloon&lt;/li&gt;
&lt;li&gt;Compaction jobs touch terabytes of data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without scaling strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jobs may fail due to timeouts or memory errors&lt;/li&gt;
&lt;li&gt;Optimization may lag behind ingestion&lt;/li&gt;
&lt;li&gt;Query performance continues to degrade despite efforts&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;1. Leveraging Partition Pruning&lt;/h2&gt;
&lt;p&gt;Partition pruning ensures that only the parts of the table that need compaction are touched.&lt;/p&gt;
&lt;p&gt;Use metadata tables to target only problem areas:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) &amp;gt; 20 AND AVG(file_size_in_bytes) &amp;lt; 100000000;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can then pass this list to a compaction job to limit the scope of the rewrite.&lt;/p&gt;
&lt;h2&gt;2. Tuning Parallelism in Spark or Flink&lt;/h2&gt;
&lt;p&gt;Large optimization jobs should run with enough parallel tasks to distribute I/O and computation.&lt;/p&gt;
&lt;p&gt;In Spark:
Use &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; to increase default parallelism.&lt;/p&gt;
&lt;p&gt;Tune executor memory and cores to handle larger partitions.&lt;/p&gt;
&lt;p&gt;Use &lt;code&gt;.option(&amp;quot;partial-progress.enabled&amp;quot;, true)&lt;/code&gt; for better resilience in Iceberg actions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;spark.conf.set(&amp;quot;spark.sql.shuffle.partitions&amp;quot;, &amp;quot;200&amp;quot;)

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option(&amp;quot;min-input-files&amp;quot;, &amp;quot;5&amp;quot;)
  .option(&amp;quot;partial-progress.enabled&amp;quot;, &amp;quot;true&amp;quot;)
  .execute()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In Flink:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use fine-grained task managers&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Enable incremental compaction and checkpointing&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;3. Incremental and Windowed Compaction&lt;/h2&gt;
&lt;p&gt;Don’t try to compact the entire table at once. Instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Group partitions into batches&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use rolling windows (e.g., compact N partitions per hour)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Resume from the last successfully compacted partition on failure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can build this logic into orchestration tools like Airflow or Dagster.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;4. Checkpointing and Partial Progress&lt;/h2&gt;
&lt;p&gt;Iceberg supports partial progress mode in Spark:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;.option(&amp;quot;partial-progress.enabled&amp;quot;, &amp;quot;true&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows successfully compacted partitions to commit, even if others fail, making retries cheaper and safer.&lt;/p&gt;
&lt;p&gt;In Flink, this is handled more granularly via stateful streaming checkpointing.&lt;/p&gt;
&lt;h2&gt;5. Retry and Failover Strategies&lt;/h2&gt;
&lt;p&gt;Wrap compaction logic in robust retry mechanisms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use exponential backoff&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Separate retries by partition&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Alert on repeated failures for human intervention&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, in Airflow:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;PythonOperator(
    task_id=&amp;quot;compact_partition&amp;quot;,
    python_callable=run_compaction,
    retries=3,
    retry_delay=timedelta(minutes=5)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Also consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Writing logs to object storage for audit&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Emitting metrics to Prometheus/Grafana for observability&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;6. Monitoring Job Health&lt;/h2&gt;
&lt;p&gt;Track:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Job duration&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Files rewritten vs skipped&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Failed partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Number of manifests reduced&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Snapshot size pre- and post-job&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;These metrics help tune parameters and detect regressions over time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Scaling Iceberg optimization jobs requires thoughtful execution planning:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use metadata to limit scope&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tune parallelism to avoid resource waste&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use partial progress and checkpointing to survive failure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Automate retries and monitor outcomes&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the final post of this series, we’ll bring it all together - showing how to build a fully autonomous optimization pipeline using orchestration, metadata triggers, and smart defaults.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Unlocking the Power of Agentic AI with Apache Iceberg and Dremio</title><link>https://iceberglakehouse.com/posts/2025-09-agentic-ai-dremio-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-agentic-ai-dremio-apache-iceberg/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 05 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0?utm_source=merced&amp;amp;utm_medium=affiliate&amp;amp;utm_campaign=book_merced&amp;amp;a_aid=merced&amp;amp;a_bid=7eac4151&quot;&gt;Purchase &amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot; (50% Off with Code MLMerced)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agentic AI is quickly moving from the whiteboard to production. These aren’t just smarter chatbots - they&apos;re intelligent systems that reason, learn, and act with autonomy. They summarize research, manage operations, and even coordinate complex workflows. But while models have become more capable, they still hit a wall without the right data infrastructure.&lt;/p&gt;
&lt;p&gt;That wall? It&apos;s not just about storage - it&apos;s about access, performance, and context.&lt;/p&gt;
&lt;p&gt;Many organizations building AI agents find themselves struggling with data silos, unpredictable performance, and a lack of clarity around what the data actually means. The result? Agents that stall, generate shallow results, or make the wrong decisions altogether.&lt;/p&gt;
&lt;p&gt;To unlock the full potential of Agentic AI, we need to rethink how our data platforms are designed. This is where Apache Iceberg and Dremio come in. Together, they provide a modern, open lakehouse architecture that solves the three core bottlenecks to AI success:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Frictionless access to enterprise data (without data wrangling or replication)&lt;/li&gt;
&lt;li&gt;Autonomous, high-performance query acceleration (built for dynamic workloads)&lt;/li&gt;
&lt;li&gt;A semantic layer that gives agents the context they need to understand and act&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, we’ll break down each of these challenges - and show how Iceberg and Dremio together build the intelligent data backbone your AI agents need to thrive.&lt;/p&gt;
&lt;h2&gt;The 3 Bottlenecks Blocking Agentic AI from Delivering Real Impact&lt;/h2&gt;
&lt;p&gt;As promising as Agentic AI is, most organizations hit the same three roadblocks on the path to real-world success. These aren&apos;t just technical hurdles - they&apos;re architectural challenges that undermine the speed, accuracy, and reliability of intelligent agents.&lt;/p&gt;
&lt;p&gt;Let’s break them down:&lt;/p&gt;
&lt;h3&gt;1. Access to Data: Silos, Bottlenecks, and Delays&lt;/h3&gt;
&lt;p&gt;AI agents need a holistic view of your enterprise to operate effectively - marketing data, operational logs, customer records, product telemetry, and more. But in most environments, that data is scattered across:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cloud storage systems&lt;/li&gt;
&lt;li&gt;Operational databases&lt;/li&gt;
&lt;li&gt;SaaS platforms&lt;/li&gt;
&lt;li&gt;Departmental data warehouses&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these systems may have different governance rules, inconsistent formats, or delayed ETL pipelines. Worse, getting access often requires waiting on central data teams or replicating data manually. This slows down experimentation and limits what your agents can “see.”&lt;/p&gt;
&lt;h3&gt;2. Performant Access: When Every Millisecond Counts&lt;/h3&gt;
&lt;p&gt;Even when agents can access data, they still need it fast. AI workflows: especially agentic ones, are dynamic and unpredictable. One minute it’s a lookup query; the next it’s a multi-join aggregation across several sources. Traditional performance tuning: manual partitioning, index maintenance, and query tuning, can’t keep up.&lt;/p&gt;
&lt;p&gt;Agents can’t wait minutes for answers. They need sub-second response times to chain actions together effectively. Without autonomous performance management, latency becomes a dealbreaker.&lt;/p&gt;
&lt;h3&gt;3. Semantic Meaning: Knowing What the Data &lt;em&gt;Actually&lt;/em&gt; Means&lt;/h3&gt;
&lt;p&gt;Access and speed are critical - but so is &lt;strong&gt;understanding&lt;/strong&gt;. AI agents need context to interpret data correctly. What does &lt;code&gt;customer_type = 2&lt;/code&gt; actually mean? Is “margin” defined the same way in marketing and finance? Without a shared semantic layer, agents operate on guesswork.&lt;/p&gt;
&lt;p&gt;This is where many AI initiatives fail quietly. Outputs look correct on the surface but are misaligned with how the business actually thinks about its data.&lt;/p&gt;
&lt;p&gt;Solving these challenges requires more than patchwork fixes. It demands a new kind of data architecture - one that is open, intelligent, and built for automation. And that’s where Apache Iceberg and Dremio make all the difference.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Open Foundation for AI-Ready Data&lt;/h2&gt;
&lt;p&gt;When it comes to building a scalable, AI-optimized data platform, Apache Iceberg is the backbone that holds it all together. It’s not just another table format - it’s the evolution of how data is organized, versioned, and accessed in modern analytics and AI environments.&lt;/p&gt;
&lt;p&gt;Think of Iceberg like the index in a giant filing cabinet. It doesn’t just store your data - it brings order, consistency, and flexibility to your data lake, making it feel like a fully featured data warehouse without giving up the openness of object storage.&lt;/p&gt;
&lt;h3&gt;Why Apache Iceberg Matters for Agentic AI&lt;/h3&gt;
&lt;p&gt;Agentic AI requires access to data that is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Consistent&lt;/strong&gt;: So the same query always returns the same answer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evolvable&lt;/strong&gt;: So schema changes don’t break downstream pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Portable&lt;/strong&gt;: So any tool: Spark, Flink, Dremio, or even your AI agents, can access it without vendor lock-in.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg delivers all of this with features like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution&lt;/strong&gt;: Add, drop, rename columns without rewriting data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time travel&lt;/strong&gt;: Query data “as of” any point in time, ideal for audits or AI state comparisons.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden partitioning&lt;/strong&gt;: Optimize performance without complicating your SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ACID transactions&lt;/strong&gt;: Ensure atomic, consistent updates in multi-writer environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By standardizing on Iceberg, your organization can avoid the “tool wars” between departments. Everyone works from the same data foundation, using the tools they prefer - whether it’s SQL notebooks, BI dashboards, or LLM-powered agents.&lt;/p&gt;
&lt;h3&gt;The Lakehouse Advantage&lt;/h3&gt;
&lt;p&gt;Iceberg unlocks the full potential of the &lt;strong&gt;lakehouse&lt;/strong&gt; model: combining the flexibility of a data lake with the performance and structure of a data warehouse. This modular approach means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Teams aren’t forced to centralize around one compute engine.&lt;/li&gt;
&lt;li&gt;You avoid redundant data copies and ETL pipelines.&lt;/li&gt;
&lt;li&gt;AI agents can query directly from the lakehouse with open standards.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, Apache Iceberg makes your data open, unified, and production-grade - everything your intelligent agents need to act with confidence.&lt;/p&gt;
&lt;h2&gt;Dremio: The Intelligent Data Interface for Agentic AI&lt;/h2&gt;
&lt;p&gt;Apache Iceberg gives you the open foundation - but Dremio turns that foundation into an intelligent, AI-ready platform. Think of Dremio as the &lt;strong&gt;control plane&lt;/strong&gt; that gives both humans and AI agents seamless access to the data they need, with speed, security, and semantic understanding built in.&lt;/p&gt;
&lt;p&gt;Let’s explore how Dremio removes the remaining friction across access, performance, and context.&lt;/p&gt;
&lt;h3&gt;Unified Access Across All Data (Federation + Simplified Governance)&lt;/h3&gt;
&lt;p&gt;Even in the best-case scenario, not all your data will live in Iceberg tables. You still have data in relational databases, SaaS tools, cloud data warehouses, and more.&lt;/p&gt;
&lt;p&gt;This is where Dremio’s &lt;strong&gt;Zero-ETL Federation&lt;/strong&gt; shines. Dremio connects directly to all your sources: whether it’s Amazon S3, PostgreSQL, Salesforce, or MongoDB, and lets you query them &lt;strong&gt;in place&lt;/strong&gt;, without copying data or building fragile pipelines.&lt;/p&gt;
&lt;p&gt;Benefits for Agentic AI:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agents can query the full landscape of enterprise data through a &lt;strong&gt;single interface&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Centralized access control means fewer credentials to manage or expose.&lt;/li&gt;
&lt;li&gt;Real-time insights from operational systems without waiting on ingestion jobs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Autonomous Performance for Unpredictable Workloads&lt;/h3&gt;
&lt;p&gt;Agentic AI is dynamic by nature - queries change based on real-time decisions. You can&apos;t rely on hand-tuned optimizations or static dashboards.&lt;/p&gt;
&lt;p&gt;Dremio solves this with &lt;strong&gt;autonomous performance management&lt;/strong&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Automatic Iceberg Table Optimization&lt;/strong&gt;: Dremio continuously compacts small files, sorts data, and maintains metadata health to reduce I/O and boost query speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt;: Dremio’s version of intelligent materialized views. They’re automatically created, updated incrementally, and substituted at query time - so your agents get faster results without changing their SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-layered Caching&lt;/strong&gt;: From query plans to result sets to object storage blocks, Dremio caches intelligently to accelerate repeat workloads and reduce cloud costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means whether your AI agent is summarizing a dashboard or crunching through user logs, it gets fast, consistent results - without human intervention.&lt;/p&gt;
&lt;h3&gt;Built-in Semantic Layer for Shared Understanding&lt;/h3&gt;
&lt;p&gt;To generate meaningful insights, agents need to understand not just what data &lt;em&gt;is&lt;/em&gt;, but what it &lt;em&gt;means&lt;/em&gt;. Dremio provides a native &lt;strong&gt;semantic layer&lt;/strong&gt; that bridges that gap.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic Search&lt;/strong&gt;: Agents and users can discover datasets using natural language.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Modeling&lt;/strong&gt;: Define reusable business logic, KPIs, and metrics as views.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auto-Generated Wikis&lt;/strong&gt;: Every dataset can include human-readable descriptions - great for onboarding both analysts and AI systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-Grained Access Control&lt;/strong&gt;: Row- and column-level security ensures agents see only what they’re authorized to.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And with Dremio’s &lt;strong&gt;MCP server&lt;/strong&gt;, your agents can programmatically explore metadata, access semantic context, and generate more accurate queries.&lt;/p&gt;
&lt;p&gt;Dremio doesn’t just connect to your data - it understands it, optimizes it, and makes it consumable by anyone (or anything) that needs it. For Agentic AI, this is the difference between guesswork and precision.&lt;/p&gt;
&lt;h2&gt;Closing the Loop: Iceberg + Dremio = AI-Optimized Lakehouse&lt;/h2&gt;
&lt;p&gt;When you bring Apache Iceberg and Dremio together, you don’t just get a modern data stack - you get a foundation built for the realities of Agentic AI.&lt;/p&gt;
&lt;p&gt;Let’s recap how these technologies align to eliminate the core bottlenecks we explored earlier:&lt;/p&gt;
&lt;h3&gt;✅ Unlocking Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; standardizes how data is stored, making it accessible across tools and teams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; federates access to all your data sources: cloud, on-prem, SaaS, and more, without the overhead of ETL or manual integration.&lt;/li&gt;
&lt;li&gt;AI agents can now query the full enterprise landscape through a single interface, using a single set of credentials, securely and efficiently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;✅ Delivering Performance Autonomously&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg enables high-performance table management (partitioning, file pruning, metadata tracking).&lt;/li&gt;
&lt;li&gt;Dremio automates this further - handling compaction, caching, and query acceleration behind the scenes.&lt;/li&gt;
&lt;li&gt;Reflections, smart caching, and autonomous query optimization ensure agents get sub-second responses, no matter how complex or spontaneous the query.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;✅ Embedding Context Through Semantics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg brings structure to your data lake, but Dremio gives it &lt;strong&gt;meaning&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Through Dremio’s built-in semantic layer and MCP server, your AI agents can interpret, navigate, and reason about data the way your business does.&lt;/li&gt;
&lt;li&gt;Whether it’s knowing what “active customer” means or filtering by business unit, Dremio gives your agents the vocabulary to deliver trusted outcomes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is a truly intelligent lakehouse - open, unified, performant, and semantically rich. One that doesn’t just serve humans, but empowers agents to act, adapt, and deliver real business value.&lt;/p&gt;
&lt;p&gt;If Agentic AI is your destination, Apache Iceberg and Dremio are the road and the vehicle that will take you there.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=agentic-ai&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Get hands-on with Dremio and Apache Iceberg today&lt;/a&gt; and start building the intelligent data foundation your AI agents need to thrive.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hidden Pitfalls  – Compaction and Partition Evolution in Apache Iceberg</title><link>https://iceberglakehouse.com/posts/iceberg-partition-evolution-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-partition-evolution-compaction/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 02 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Hidden Pitfalls : Compaction and Partition Evolution in Apache Iceberg&lt;/h1&gt;
&lt;p&gt;Apache Iceberg offers &lt;strong&gt;partition evolution&lt;/strong&gt;, allowing you to change how your data is partitioned over time without rewriting historical files. This is a major advantage over legacy file formats, but it also introduces new challenges - especially when it comes to &lt;strong&gt;compaction and query optimization&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore how partition evolution can impact compaction, metadata management, and query performance - and how to avoid the most common pitfalls.&lt;/p&gt;
&lt;h2&gt;What Is Partition Evolution?&lt;/h2&gt;
&lt;p&gt;Partition evolution allows you to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Add new partition fields&lt;/li&gt;
&lt;li&gt;Drop old partition fields&lt;/li&gt;
&lt;li&gt;Change partition transforms (e.g., from &lt;code&gt;day(ts)&lt;/code&gt; to &lt;code&gt;hour(ts)&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Unlike traditional systems that enforce a single static layout, Iceberg lets you evolve the partitioning strategy without rewriting or invalidating historical data.&lt;/p&gt;
&lt;h3&gt;Example:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Original partitioning
ALTER TABLE sales ADD PARTITION FIELD day(order_date);

-- Later evolve to hourly
ALTER TABLE sales DROP PARTITION FIELD day(order_date);
ALTER TABLE sales ADD PARTITION FIELD hour(order_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each snapshot will respect the partition spec that was active at the time the data was written.&lt;/p&gt;
&lt;h2&gt;The Pitfall: Compaction Across Partition Specs&lt;/h2&gt;
&lt;p&gt;When compaction jobs span files written under different partition specs, several challenges arise:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;File Layout Inconsistency
Compaction may combine files that don’t share a common layout, resulting in mixed partition values that reduce query pruning efficiency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reduced Predicate Pushdown
Query engines rely on partition columns for efficient pruning. If files are mixed across specs, pruning may be incomplete, increasing scan cost.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Compaction Failures or Misbehavior
Some engines may fail to rewrite or rewrite files improperly when specs conflict, especially in older versions of Iceberg libraries or poorly configured environments.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Best Practices to Manage Partition Evolution Safely&lt;/h2&gt;
&lt;h3&gt;1. Compact Within Partition Spec Versions&lt;/h3&gt;
&lt;p&gt;Query the files metadata table to identify which files belong to which spec:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;Copy
Edit
SELECT spec_id, COUNT(*) AS file_count
FROM my_table.files
GROUP BY spec_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run compaction per spec_id to preserve consistency and avoid mixing files.&lt;/p&gt;
&lt;h3&gt;2. Track and Align Sorting and Clustering&lt;/h3&gt;
&lt;p&gt;When evolving partitions, ensure that sort orders are also updated. Mismatched sort and partition strategies can undermine clustering efforts.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT spec_id, sort_order_id, COUNT(*)
FROM my_table.files
GROUP BY spec_id, sort_order_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Repartition Carefully and Gradually&lt;/h3&gt;
&lt;p&gt;Avoid abrupt changes like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Switching from coarse to fine partitioning (e.g., day to minute)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dropping too many partition fields at once&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;These can lead to over-fragmentation and more small files unless paired with compaction and sort order realignment.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Use Metadata Tables to Guide Evolution&lt;/h3&gt;
&lt;p&gt;Before evolving a partition spec:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Inspect query patterns (e.g., WHERE clauses)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Evaluate partition sizes and access frequencies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use tools like Dremio’s catalog lineage and query analyzer if available&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Communicate Changes Across Teams&lt;/h3&gt;
&lt;p&gt;If your tables are used across multiple teams or tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Document changes to partitioning logic&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Include schema and partition spec history in data documentation&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Coordinate compaction jobs after major partition changes&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Partition evolution is one of Iceberg’s superpowers - but like all powerful features, it must be used wisely. To avoid performance and optimization issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Don’t mix files with different partition specs in compaction jobs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Update sort orders and clustering with partition changes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor partition usage and access patterns continuously&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll move from structural design to execution tuning - exploring how to scale compaction operations efficiently using parallelism, checkpointing, and fault tolerance.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using Iceberg Metadata Tables to Determine When Compaction Is Needed</title><link>https://iceberglakehouse.com/posts/iceberg-metadata-triggered-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-metadata-triggered-compaction/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 26 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Using Iceberg Metadata Tables to Determine When Compaction Is Needed&lt;/h1&gt;
&lt;p&gt;Scheduling compaction at fixed intervals is better than not optimizing at all - but it can still lead to unnecessary compute spend or delayed maintenance. A smarter approach is to &lt;strong&gt;dynamically trigger compaction&lt;/strong&gt; based on &lt;strong&gt;real-time metadata signals&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Apache Iceberg makes this possible with its powerful system of &lt;strong&gt;metadata tables&lt;/strong&gt;, which expose granular details about files, snapshots, and manifests.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll explore how to query these tables to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detect small files&lt;/li&gt;
&lt;li&gt;Identify bloated partitions&lt;/li&gt;
&lt;li&gt;Spot manifest inefficiencies&lt;/li&gt;
&lt;li&gt;Automate event-driven compaction workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Are Iceberg Metadata Tables?&lt;/h2&gt;
&lt;p&gt;Every Iceberg table automatically maintains a set of virtual tables that expose its internals. The most relevant for optimization include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;files&lt;/code&gt; – List of all data files in the table, including size, partition, and metrics&lt;/li&gt;
&lt;li&gt;&lt;code&gt;manifests&lt;/code&gt; – List of manifest files and the data files they reference&lt;/li&gt;
&lt;li&gt;&lt;code&gt;snapshots&lt;/code&gt; – History of table changes and snapshot metadata&lt;/li&gt;
&lt;li&gt;&lt;code&gt;history&lt;/code&gt; – Timeline of snapshot commits and their lineage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tables can be queried like any other SQL table, making it easy to introspect your table’s health.&lt;/p&gt;
&lt;h2&gt;1. Detecting Small Files with the &lt;code&gt;files&lt;/code&gt; Table&lt;/h2&gt;
&lt;p&gt;To identify partitions suffering from small file syndrome:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  partition,
  COUNT(*) AS file_count,
  AVG(file_size_in_bytes) AS avg_size_bytes
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) &amp;gt; 10 AND AVG(file_size_in_bytes) &amp;lt; 134217728; -- 128 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can use this to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Trigger compaction on specific partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor trends in file size distribution over time&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;2. Finding Fragmented or Stale Manifests&lt;/h2&gt;
&lt;p&gt;Bloated metadata can come from too many or inefficient manifest files. Use the manifests table to explore:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  COUNT(*) AS manifest_count,
  AVG(added_data_files_count) AS avg_files_per_manifest
FROM my_table.manifests;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Low averages can indicate fragmented manifests that are good candidates for rewriting.&lt;/p&gt;
&lt;h2&gt;3. Tracking Snapshot Volume and Velocity&lt;/h2&gt;
&lt;p&gt;To see if snapshots are accumulating too fast (and increasing metadata overhead):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  COUNT(*) AS snapshot_count,
  MIN(committed_at) AS first_snapshot,
  MAX(committed_at) AS latest_snapshot
FROM my_table.snapshots;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also inspect how many files each snapshot adds or removes to identify noisy patterns from ingestion jobs.&lt;/p&gt;
&lt;h2&gt;4. Building a Health Score&lt;/h2&gt;
&lt;p&gt;By combining file count, file size, manifest count, and snapshot frequency, you can compute a &amp;quot;table health score&amp;quot;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Example: High file count + small average size = poor health
WITH file_stats AS (
  SELECT COUNT(*) AS total_files, AVG(file_size_in_bytes) AS avg_file_size
  FROM my_table.files
),
manifest_stats AS (
  SELECT COUNT(*) AS total_manifests
  FROM my_table.manifests
)
SELECT
  total_files,
  avg_file_size,
  total_manifests,
  CASE
    WHEN avg_file_size &amp;lt; 67108864 AND total_files &amp;gt; 1000 THEN &apos;Needs compaction&apos;
    ELSE &apos;Healthy&apos;
  END AS status
FROM file_stats, manifest_stats;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;5. Triggering Compaction Automatically&lt;/h2&gt;
&lt;p&gt;Once you identify problematic patterns, you can wire up your orchestration layer to act:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use Airflow, Dagster, or dbt Cloud to run SQL-based checks&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When thresholds are breached, trigger Spark/Flink compaction jobs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Track results and update monitoring dashboards&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;This ensures you optimize only when needed, keeping costs and latency low.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of Metadata-Driven Optimization&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Precision: Only touch affected partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Efficiency: Avoid unnecessary compute jobs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Responsiveness: React to real-time ingestion patterns&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Governance: Create audit trails for all compaction decisions&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Apache Iceberg gives you visibility and control over your tables through metadata tables. By tapping into this metadata:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;You avoid blind scheduling of compaction&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You build smarter, more efficient optimization workflows&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You reduce both query latency and operational cost&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll dive into partition evolution and layout pitfalls, and how to avoid undermining your compaction and clustering strategies when schemas or partitions change.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Designing the Ideal Cadence for Compaction and Snapshot Expiration</title><link>https://iceberglakehouse.com/posts/iceberg-optimization-cadence/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-optimization-cadence/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 19 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Designing the Ideal Cadence for Compaction and Snapshot Expiration&lt;/h1&gt;
&lt;p&gt;In previous posts, we explored how compaction and snapshot expiration keep Apache Iceberg tables performant and lean. But these actions aren’t one-and-done - they need to be &lt;strong&gt;scheduled strategically&lt;/strong&gt; to balance compute cost, data freshness, and operational safety.&lt;/p&gt;
&lt;p&gt;In this post, we’ll look at how to design a &lt;strong&gt;cadence&lt;/strong&gt; for compaction and snapshot expiration based on your workload patterns, data criticality, and infrastructure constraints.&lt;/p&gt;
&lt;h2&gt;Why Cadence Matters&lt;/h2&gt;
&lt;p&gt;Without a thoughtful schedule:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Over-optimization&lt;/strong&gt; can waste compute and create unnecessary load&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Under-optimization&lt;/strong&gt; leads to performance degradation and metadata bloat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poor coordination&lt;/strong&gt; can cause clashes with ingestion or query jobs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You need a cadence that fits your data’s lifecycle and your platform’s SLAs.&lt;/p&gt;
&lt;h2&gt;Key Factors to Consider&lt;/h2&gt;
&lt;h3&gt;1. &lt;strong&gt;Ingestion Rate and Pattern&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streaming data?&lt;/strong&gt; Expect high file churn. Compact frequently (hourly or near-real-time).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch jobs?&lt;/strong&gt; Compact after each large load or on a daily schedule.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid?&lt;/strong&gt; Monitor ingestion metrics and trigger compaction based on thresholds.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Query Frequency and Latency Expectations&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High query volume tables&lt;/strong&gt; benefit from more frequent compaction to improve scan performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Low-usage tables&lt;/strong&gt; can tolerate more infrequent optimization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Storage Costs and File System Limits&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Cloud storage costs can balloon with small files and lingering unreferenced data.&lt;/li&gt;
&lt;li&gt;File system metadata limits may also be a concern at massive scale.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Retention and Governance Requirements&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Snapshots may need to be retained longer for audit or rollback policies.&lt;/li&gt;
&lt;li&gt;Balance expiration with compliance needs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Suggested Cadence Models&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Compaction Cadence&lt;/th&gt;
&lt;th&gt;Snapshot Expiration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;High-volume streaming pipeline&lt;/td&gt;
&lt;td&gt;Hourly or event-based&lt;/td&gt;
&lt;td&gt;Daily, keep 1–3 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily batch ingestion&lt;/td&gt;
&lt;td&gt;Post-batch or nightly&lt;/td&gt;
&lt;td&gt;Weekly, keep 7–14 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency analytics&lt;/td&gt;
&lt;td&gt;Hourly&lt;/td&gt;
&lt;td&gt;Daily, keep 3–5 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulatory or audited data&lt;/td&gt;
&lt;td&gt;Weekly or on-demand&lt;/td&gt;
&lt;td&gt;Monthly, retain 30–90 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Use metadata queries (e.g., from &lt;code&gt;files&lt;/code&gt;, &lt;code&gt;manifests&lt;/code&gt;, &lt;code&gt;snapshots&lt;/code&gt;) to drive dynamic policies.&lt;/p&gt;
&lt;h2&gt;Automating the Schedule&lt;/h2&gt;
&lt;p&gt;You can use orchestration tools like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Airflow / Dagster / Prefect&lt;/strong&gt;: Schedule and monitor compaction and expiration tasks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt Cloud&lt;/strong&gt;: Use post-run hooks or scheduled jobs to optimize models backed by Iceberg&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink / Spark Streaming&lt;/strong&gt;: Trigger compaction inline or via micro-batch jobs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tip: Tag critical jobs with priorities and isolate them from ingestion workloads where needed.&lt;/p&gt;
&lt;h2&gt;Coordinating Between Compaction and Expiration&lt;/h2&gt;
&lt;p&gt;Ideally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Compact first&lt;/strong&gt;, then &lt;strong&gt;expire snapshots&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;This ensures snapshots written by compaction are retained at least temporarily&lt;/li&gt;
&lt;li&gt;Avoid expiring snapshots too soon after compaction to prevent data loss&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example Workflow:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Run metadata scan to detect small file bloat&lt;/li&gt;
&lt;li&gt;Trigger compaction on affected partitions&lt;/li&gt;
&lt;li&gt;Delay snapshot expiration by a few hours&lt;/li&gt;
&lt;li&gt;Run snapshot expiration with a safety buffer&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Monitoring and Adjusting Over Time&lt;/h2&gt;
&lt;p&gt;Cadence isn’t static - adjust based on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Changing ingestion rates&lt;/li&gt;
&lt;li&gt;New query patterns&lt;/li&gt;
&lt;li&gt;Storage trends&lt;/li&gt;
&lt;li&gt;Platform feedback (slow queries, GC delays, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use logs, metadata tables, and query performance dashboards to guide adjustments.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;An effective compaction and snapshot expiration cadence keeps your Iceberg tables fast, lean, and cost-effective. Your schedule should:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Match your workload patterns&lt;/li&gt;
&lt;li&gt;Respect operational and governance needs&lt;/li&gt;
&lt;li&gt;Be flexible and monitorable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll look at how to use &lt;strong&gt;Iceberg’s metadata tables&lt;/strong&gt; to dynamically determine &lt;em&gt;when&lt;/em&gt; optimization is needed - so you can make it event-driven instead of fixed-schedule.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Avoiding Metadata Bloat with Snapshot Expiration and Rewriting Manifests</title><link>https://iceberglakehouse.com/posts/iceberg-metadata-bloat-cleanup/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-metadata-bloat-cleanup/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 12 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Avoiding Metadata Bloat with Snapshot Expiration and Rewriting Manifests&lt;/h1&gt;
&lt;p&gt;As your Apache Iceberg tables evolve: through continuous writes, schema changes, and compaction jobs, they generate a growing amount of &lt;strong&gt;metadata&lt;/strong&gt;. While metadata is a powerful feature in Iceberg, enabling time travel and auditability, &lt;strong&gt;unchecked metadata growth&lt;/strong&gt; can lead to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slower planning and query times&lt;/li&gt;
&lt;li&gt;Increased storage costs&lt;/li&gt;
&lt;li&gt;Longer table commit and rollback operations&lt;/li&gt;
&lt;li&gt;Excessive memory usage during scans&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, we’ll explore how to &lt;strong&gt;expire old snapshots&lt;/strong&gt; and &lt;strong&gt;rewrite manifests&lt;/strong&gt; to keep your Iceberg tables lean, responsive, and cost-efficient.&lt;/p&gt;
&lt;h2&gt;What Causes Metadata Bloat?&lt;/h2&gt;
&lt;p&gt;Iceberg tracks table state through a series of &lt;strong&gt;snapshots&lt;/strong&gt;. Each snapshot references a set of &lt;strong&gt;manifest lists&lt;/strong&gt;, which in turn reference &lt;strong&gt;manifest files&lt;/strong&gt; describing individual data files.&lt;/p&gt;
&lt;p&gt;Bloat occurs when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Snapshots accumulate and are not expired&lt;/li&gt;
&lt;li&gt;Manifests are duplicated across snapshots&lt;/li&gt;
&lt;li&gt;Files are replaced by compaction but older snapshots still reference them&lt;/li&gt;
&lt;li&gt;Streaming ingestion creates frequent small commits, generating excessive metadata&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Expiring Snapshots&lt;/h2&gt;
&lt;p&gt;You can safely remove older snapshots using Iceberg’s built-in expiration functionality. This deletes metadata for snapshots that are no longer needed for time travel, rollback, or audit purposes.&lt;/p&gt;
&lt;h3&gt;Example in Spark:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .expireSnapshots()
  .expireOlderThan(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(7)) // keep 7 days
  .retainLast(2) // keep last 2 snapshots no matter what
  .execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This keeps recent snapshots while cleaning up older ones, freeing up metadata and unreferenced data files (if garbage collection is also enabled).&lt;/p&gt;
&lt;h3&gt;Guidelines:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Retain at least a few recent snapshots for rollback safety&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use a time-based and count-based retention policy&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Coordinate expiration with your data governance policies&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Rewriting Manifests&lt;/h2&gt;
&lt;p&gt;Over time, manifest files can become inefficient:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Many may reference the same files across snapshots&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Some may contain only a few files due to small writes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Their layout may be suboptimal for query planning&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can rewrite manifests to consolidate and reorganize them for improved performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example in Spark:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;Actions.forTable(spark, table)
  .rewriteManifests()
  .execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This reduces metadata file count, organizes manifests by partition and sort order, and can improve query planning times.&lt;/p&gt;
&lt;h2&gt;When Should You Perform Metadata Cleanup?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;After large ingestion spikes (e.g., backfills)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Following streaming workloads with high commit frequency&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Post compaction or schema evolution&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;On a scheduled basis (e.g., daily or weekly)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Bonus: Use Metadata Tables to Inspect Bloat&lt;/h2&gt;
&lt;p&gt;Iceberg’s metadata tables help you inspect how much bloat has built up.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT snapshot_id, added_files_count, total_data_files_count
FROM my_table.snapshots
ORDER BY committed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT COUNT(*) FROM my_table.manifests;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These insights can help you determine when cleanup is needed.&lt;/p&gt;
&lt;h2&gt;Tradeoffs and Cautions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Snapshot expiration is irreversible: Make sure you don’t need the old snapshots for recovery or audit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Manifests rewrites are safe but can be compute-intensive on large tables - schedule wisely.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Storage GC may require coordination with your catalog to clean up unreferenced files.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Metadata is a powerful part of Iceberg’s architecture, but without routine maintenance, it can weigh down your table performance. By:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Expiring stale snapshots&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rewriting bloated manifests&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitoring metadata tables regularly&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You ensure that your Iceberg tables remain agile, scalable, and ready for production workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll explore how to design the ideal cadence for compaction and snapshot expiration so your optimizations are timely and cost-effective.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Smarter Data Layout  – Sorting and Clustering Iceberg Tables</title><link>https://iceberglakehouse.com/posts/iceberg-clustering-sorting-zorder/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-clustering-sorting-zorder/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 05 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Smarter Data Layout : Sorting and Clustering Iceberg Tables&lt;/h1&gt;
&lt;p&gt;So far in this series, we&apos;ve focused on optimizing file sizes to reduce metadata and scan overhead. But &lt;strong&gt;how data is laid out within those files&lt;/strong&gt; can be just as important as the size of the files themselves.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll explore &lt;strong&gt;clustering techniques in Apache Iceberg&lt;/strong&gt;, including &lt;strong&gt;sort order&lt;/strong&gt; and &lt;strong&gt;Z-ordering&lt;/strong&gt;, and how these techniques improve query performance by reducing the amount of data that needs to be read.&lt;/p&gt;
&lt;h2&gt;Why Clustering Matters&lt;/h2&gt;
&lt;p&gt;Imagine a query that filters on a &lt;code&gt;customer_id&lt;/code&gt;. If your data is randomly distributed, every file needs to be scanned. But if the data is sorted or clustered, the engine can skip over entire files or row groups , reducing I/O and speeding up execution.&lt;/p&gt;
&lt;p&gt;Clustering benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fewer files and rows scanned&lt;/li&gt;
&lt;li&gt;Better compression ratios&lt;/li&gt;
&lt;li&gt;Faster joins and aggregations&lt;/li&gt;
&lt;li&gt;More efficient pruning of partitions and row groups&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Sorting in Iceberg&lt;/h2&gt;
&lt;p&gt;Iceberg supports &lt;strong&gt;sort order evolution&lt;/strong&gt;, which lets you define how data should be physically sorted as it&apos;s written or rewritten.&lt;/p&gt;
&lt;p&gt;You can define sort orders during write or compaction:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;import org.apache.iceberg.SortOrder
import static org.apache.iceberg.expressions.Expressions.*;

table.updateSortOrder()
  .sortBy(asc(&amp;quot;customer_id&amp;quot;), desc(&amp;quot;order_date&amp;quot;))
  .commit();
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Use Cases for Sorting&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Time-series data:&lt;/strong&gt; sort by event_time to improve range queries&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dimension filters:&lt;/strong&gt; sort by commonly filtered columns like region, user_id&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Joins:&lt;/strong&gt; sort by join keys to speed up hash joins and reduce shuffling&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Z-order Clustering&lt;/h2&gt;
&lt;p&gt;Z-ordering is a multi-dimensional clustering technique that co-locates related values across multiple columns. It&apos;s ideal for exploratory queries that filter on different combinations of columns.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;table.updateSortOrder()
  .sortBy(zorder(&amp;quot;customer_id&amp;quot;, &amp;quot;product_id&amp;quot;, &amp;quot;region&amp;quot;))
  .commit();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Z-ordering works by interleaving bits from multiple columns to keep related rows close together. This increases the chance that queries filtering on any subset of these columns can benefit from data skipping.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Z-ordering is supported by Iceberg through integrations like Dremio&apos;s Iceberg Auto-Clustering and Spark jobs using RewriteDataFiles.&lt;/p&gt;
&lt;h2&gt;Choosing Between Sort and Z-order&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Best Technique&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Filtering on one key column&lt;/td&gt;
&lt;td&gt;Simple Sort&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Range queries on timestamps&lt;/td&gt;
&lt;td&gt;Sort on time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-column filtering&lt;/td&gt;
&lt;td&gt;Z-order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins on a key column&lt;/td&gt;
&lt;td&gt;Sort on join key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex OLAP-style filters&lt;/td&gt;
&lt;td&gt;Z-order&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;When to Apply Clustering&lt;/h2&gt;
&lt;p&gt;Clustering is typically applied:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;During initial writes, if the engine supports it&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;As part of compaction jobs, using RewriteDataFiles with sort order&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In Spark, you can specify sort order in rewrite actions:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;Actions.forTable(spark, table)
  .rewriteDataFiles()
  .sortBy(&amp;quot;region&amp;quot;, &amp;quot;event_time&amp;quot;)
  .execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Make sure the sort order aligns with your most frequent query patterns.&lt;/p&gt;
&lt;h2&gt;Tradeoffs&lt;/h2&gt;
&lt;p&gt;While clustering helps query performance, it comes with tradeoffs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Sorting increases job duration: Sorting is more expensive than just rewriting files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Clustering can become outdated: Evolving data patterns may require adjusting sort orders&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Not all engines respect sort order: Make sure your query engine leverages the layout&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Smart data layout is essential for fast queries in Apache Iceberg. By leveraging sorting and Z-order clustering:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;You reduce the volume of data scanned&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Improve filter selectivity&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Optimize performance for a wide variety of workloads&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll look at another silent performance killer: metadata bloat, and how to clean it up using snapshot expiration and manifest rewriting.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Optimizing Compaction for Streaming Workloads in Apache Iceberg</title><link>https://iceberglakehouse.com/posts/iceberg-streaming-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-streaming-compaction/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 29 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Optimizing Compaction for Streaming Workloads in Apache Iceberg&lt;/h1&gt;
&lt;p&gt;In traditional batch pipelines, compaction jobs can run in large windows during idle periods. But in streaming workloads, data is written continuously: often in small increments, leading to rapid small file accumulation and tight freshness requirements.&lt;/p&gt;
&lt;p&gt;So how do we compact Iceberg tables without interfering with ingestion and latency-sensitive reads? This post explores how to &lt;strong&gt;design efficient, incremental compaction jobs&lt;/strong&gt; that preserve performance without disrupting your streaming pipelines.&lt;/p&gt;
&lt;h2&gt;The Challenge with Streaming + Compaction&lt;/h2&gt;
&lt;p&gt;Streaming ingestion into Apache Iceberg often uses micro-batches or event-driven triggers that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate many small files per partition&lt;/li&gt;
&lt;li&gt;Write new snapshots frequently&lt;/li&gt;
&lt;li&gt;Introduce high metadata churn&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A naive compaction job that rewrites entire partitions or the whole table risks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Commit contention&lt;/strong&gt; with streaming jobs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stale data&lt;/strong&gt; in read replicas or downstream queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency spikes&lt;/strong&gt; if compaction blocks snapshot availability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key is to &lt;strong&gt;optimize incrementally and intelligently.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Techniques for Streaming-Safe Compaction&lt;/h2&gt;
&lt;h3&gt;1. &lt;strong&gt;Compact Only Cold Partitions&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Don’t rewrite partitions actively being written to. Instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identify &amp;quot;cold&amp;quot; partitions (e.g., older than 1 hour if partioned by hour)&lt;/li&gt;
&lt;li&gt;Compact only those to avoid conflicts with streaming writes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example query using metadata table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition, COUNT(*) AS file_count
FROM my_table.files
WHERE last_modified &amp;lt; current_timestamp() - INTERVAL &apos;1 hour&apos;
GROUP BY partition
HAVING COUNT(*) &amp;gt; 10;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This can drive dynamic, safe compaction logic in orchestration tools.&lt;/p&gt;
&lt;h3&gt;2. Use Incremental Compaction Windows&lt;/h3&gt;
&lt;p&gt;Instead of full rewrites:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Compact only a subset of files at a time (e.g., oldest or smallest)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Avoid reprocessing already optimized files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reduce job run time to minutes instead of hours&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Spark&apos;s RewriteDataFiles and Dremio&apos;s &lt;code&gt;OPTIMIZE&lt;/code&gt; features both support targeted rewrites.&lt;/p&gt;
&lt;h3&gt;3. Trigger Based on Metadata Metrics&lt;/h3&gt;
&lt;p&gt;Rather than scheduling compaction at fixed intervals, use metadata-driven triggers like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Number of files per partition &amp;gt; threshold&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Average file size &amp;lt; target&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;File age &amp;gt; threshold&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can track these via files and manifests metadata tables and use orchestration tools (e.g., Airflow, Dagster, dbt Cloud) to trigger compaction.&lt;/p&gt;
&lt;p&gt;Example: Time-Based Compaction Script (Pseudo-code)&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# For each partition older than 1 hour with many small files
for partition in get_partitions_older_than(hours=1):
    if count_small_files(partition) &amp;gt; threshold:
        run_compaction(partition)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern allows incremental, scoped jobs that don’t touch fresh data.&lt;/p&gt;
&lt;h2&gt;Tuning for Performance&lt;/h2&gt;
&lt;p&gt;Parallelism: Use high parallelism for wide tables to speed up job runtime&lt;/p&gt;
&lt;p&gt;Target file size: Stick to 128MB–256MB range unless your queries benefit from larger files&lt;/p&gt;
&lt;p&gt;Retries and check-pointing: Make sure jobs are fault-tolerant in production&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;To maintain performance in streaming Iceberg pipelines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Compact frequently, but narrowly&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use metadata to guide scope&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Avoid active partitions and large rewrites&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Leverage orchestration and branching when available&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the right setup, you can keep query performance and data freshness high - without sacrificing one for the other.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Basics of Compaction  – Bin Packing Your Data for Efficiency</title><link>https://iceberglakehouse.com/posts/iceberg-optimization-compaction-basics/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-optimization-compaction-basics/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 22 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;The Basics of Compaction : Bin Packing Your Data for Efficiency&lt;/h1&gt;
&lt;p&gt;In the first post of this series, we explored how Apache Iceberg tables degrade when left unoptimized. Now it&apos;s time to look at the most foundational optimization technique: &lt;strong&gt;compaction&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Compaction is the process of merging small files into larger ones to reduce file system overhead and improve query performance. In Iceberg, this usually takes the form of &lt;strong&gt;bin packing&lt;/strong&gt; : grouping smaller files together so they align with an optimal size target.&lt;/p&gt;
&lt;h2&gt;Why Bin Packing Matters&lt;/h2&gt;
&lt;p&gt;Query engines like Dremio, Trino, and Spark operate more efficiently when reading a smaller number of larger files instead of a large number of tiny files. Every file adds cost:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It triggers an I/O request&lt;/li&gt;
&lt;li&gt;It needs to be tracked in metadata&lt;/li&gt;
&lt;li&gt;It increases planning and scheduling complexity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By merging many small files into fewer large files, compaction directly addresses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Small file problem&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata bloat in manifests&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inefficient scan patterns&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;How Standard Compaction Works&lt;/h2&gt;
&lt;p&gt;A typical Iceberg compaction job involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Scanning the table&lt;/strong&gt; to identify small files below a certain threshold.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reading and coalescing records&lt;/strong&gt; from multiple small files within a partition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing out new files&lt;/strong&gt; targeting an optimal size (commonly 128MB–512MB per file).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Creating a new snapshot&lt;/strong&gt; that references the new files and drops the older ones.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process can be orchestrated using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt; with Iceberg’s &lt;code&gt;RewriteDataFiles&lt;/code&gt; action&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; with its &lt;code&gt;OPTIMIZE&lt;/code&gt; command&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example: Spark Action&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option(&amp;quot;target-file-size-bytes&amp;quot;, 134217728) // 128 MB
  .execute()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will identify and bin-pack small files across partitions, replacing them with larger files.&lt;/p&gt;
&lt;h2&gt;Tips for Running Compaction&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Target file size:&lt;/strong&gt; Match your engine’s ideal scan size. 128MB or 256MB often work well.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition scope:&lt;/strong&gt; You can compact per partition to avoid touching the entire table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Job parallelism:&lt;/strong&gt; Tune parallelism to handle large volumes efficiently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Avoid overlap:&lt;/strong&gt; If streaming ingestion is running, compaction jobs should avoid writing to the same partitions concurrently (we’ll cover this in Part 3).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When Should You Run It?&lt;/h2&gt;
&lt;p&gt;That depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ingestion frequency:&lt;/strong&gt; Frequent writes = more small files = more frequent compaction&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query behavior:&lt;/strong&gt; If queries touch recently ingested data, compact often&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table size and storage costs:&lt;/strong&gt; The larger the table, the more benefit from compaction&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In many cases, a daily or hourly schedule works well. Some platforms support event-driven compaction based on file count or size thresholds.&lt;/p&gt;
&lt;h2&gt;Tradeoffs&lt;/h2&gt;
&lt;p&gt;While compaction boosts performance, it also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Consumes compute and I/O resources&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Temporarily increases storage (until old files are expired)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Can interfere with concurrent writes if not carefully scheduled&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s why timing and scope matter: a theme we’ll return to later in this series.&lt;/p&gt;
&lt;h2&gt;Up Next&lt;/h2&gt;
&lt;p&gt;Now that you understand standard compaction, the next challenge is applying it without interrupting streaming workloads. In Part 3, we’ll explore techniques to make compaction faster, safer, and more incremental for real-time pipelines.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Cost of Neglect  – How Apache Iceberg Tables Degrade Without Optimization</title><link>https://iceberglakehouse.com/posts/iceberg-optimization-degradation/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-optimization-degradation/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 15 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;The Cost of Neglect : How Apache Iceberg Tables Degrade Without Optimization&lt;/h1&gt;
&lt;p&gt;Apache Iceberg offers powerful features for managing large-scale datasets with reliability, versioning, and schema evolution. But like any robust system, Iceberg tables require care and maintenance. Without ongoing optimization, even the most well-designed Iceberg table can degrade - causing query slowdowns, ballooning metadata, and rising infrastructure costs.&lt;/p&gt;
&lt;p&gt;This post kicks off a 10-part series on &lt;strong&gt;Apache Iceberg Table Optimization&lt;/strong&gt;, beginning with a look at &lt;em&gt;what happens when you don’t optimize&lt;/em&gt; and why it matters.&lt;/p&gt;
&lt;h2&gt;Why Do Iceberg Tables Degrade?&lt;/h2&gt;
&lt;p&gt;At its core, Iceberg uses a &lt;strong&gt;table metadata layer&lt;/strong&gt; to track the location and structure of physical files (data files, manifests, and manifest lists). Over time, various ingestion patterns: batch loads, streaming micro-batches, late-arriving records, can lead to an accumulation of inefficiencies:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Small Files Problem&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Each write operation typically creates a new data file. In streaming or frequent ingestion pipelines, this can lead to thousands of tiny files that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increase the number of file system operations during scans&lt;/li&gt;
&lt;li&gt;Reduce the effectiveness of predicate pushdown and pruning&lt;/li&gt;
&lt;li&gt;Add overhead to table metadata (larger manifest files)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Fragmented Manifests&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Each new snapshot creates new manifest files. If the same files appear in many manifests or are not compacted, snapshot metadata becomes expensive to read and maintain.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Bloated Snapshots&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Iceberg maintains a full history of table snapshots unless explicitly expired. Over time, this bloats the metadata layer with obsolete entries:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slows down time travel and rollback operations&lt;/li&gt;
&lt;li&gt;Inflates table size even if the data volume is static&lt;/li&gt;
&lt;li&gt;Consumes storage and memory unnecessarily&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Unclustered or Unsorted Data&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Without explicit clustering or sort order, files may be written in a way that scatters relevant records across multiple files. This leads to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increased scan ranges and data reads during filtering&lt;/li&gt;
&lt;li&gt;Poor locality for analytical queries&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Partition Imbalance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;When partitions grow at uneven rates, you may end up with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Some partitions containing massive files&lt;/li&gt;
&lt;li&gt;Others being overloaded with small files&lt;/li&gt;
&lt;li&gt;Query planning bottlenecks on overgrown partitions&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Are the Consequences?&lt;/h2&gt;
&lt;p&gt;These degradations manifest as tangible issues across your data platform:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Performance Hits:&lt;/strong&gt; Query scans take longer and use more compute resources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Higher Costs:&lt;/strong&gt; More files and metadata inflate cloud storage bills and increase query processing cost in engines like Dremio, Trino, or Spark.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Longer Maintenance Windows:&lt;/strong&gt; Snapshot expiration, schema evolution, and compaction become more expensive over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced Freshness and Responsiveness:&lt;/strong&gt; Particularly in streaming use cases, lag builds up if optimizations are not happening incrementally.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Causes This Degradation?&lt;/h2&gt;
&lt;p&gt;Most of these issues stem from a lack of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Regular &lt;strong&gt;compaction&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Snapshot and metadata &lt;strong&gt;cleanup&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Monitoring table &lt;strong&gt;health metrics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clustering and layout optimization&lt;/strong&gt; during writes&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Looking Ahead&lt;/h2&gt;
&lt;p&gt;The good news is that Apache Iceberg provides powerful tools to fix these issues - with the right strategy. In the next posts, we’ll break down each optimization method, starting with standard compaction and how to implement it effectively.&lt;/p&gt;
&lt;p&gt;Stay tuned for Part 2: &lt;strong&gt;The Basics of Compaction : Bin Packing Your Data for Efficiency&lt;/strong&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Discover or Organize Lakehouse &amp; Apache Iceberg Meetups</title><link>https://iceberglakehouse.com/posts/2025-07-discovering-or-organizing-lakehouse-iceberg-meetups/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-07-discovering-or-organizing-lakehouse-iceberg-meetups/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-07-discovering-or-organizing-lakeho...</description><pubDate>Thu, 03 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-07-discovering-or-organizing-lakehouse-iceberg-meetups/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Planning a meetup around Apache Iceberg or modern data lakehouse architectures? Whether you&apos;re looking to host your first community event or expand your existing network, discovering and organizing meetups can be both rewarding and impactful. These gatherings offer an opportunity to connect with other data professionals, share best practices, and explore cutting-edge tools and architectures. In this blog, we&apos;ll explore how to find and collaborate with existing data communities, discover upcoming Iceberg and lakehouse-related events, and provide tips on organizing your own meetup. We&apos;ll also share links to online communities, tools, and platforms to help you build momentum around your event and grow your local or virtual data community.&lt;/p&gt;
&lt;h1&gt;Step 1: Join the Related Communities&lt;/h1&gt;
&lt;p&gt;Slack communities for different lakehouse communities are going to be one of the best places to find people to collaborate with. In certain communities there are dedicated channels for meetups that make easier to discover people looking to collaborate in your area&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://join.slack.com/t/dataeventssla-pnp1776/shared_invite/zt-38vgrooy9-U9ral_gr3NAz_Siih1QwmQ&quot;&gt;Data Events Slack Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://join.slack.com/t/thedatalakehousehub/shared_invite/zt-274yc8sza-mI2zhCW8LGkOh1uxuf8T5Q&quot;&gt;The Data Lakehouse Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://iceberg.apache.org/community/&quot;&gt;Apache Iceberg Slack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://polaris.apache.org/community/&quot;&gt;The Apache Polaris Slack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hudi.apache.org/community/get-involved&quot;&gt;The Apache Hudi Slack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://delta.io/community/&quot;&gt;Delta Lake Slack&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Step 2: Where to collaborate&lt;/h1&gt;
&lt;p&gt;A good pattern to use is to create a meetup channel if it doesn&apos;t already exist for your area like &lt;code&gt;#meetup-atlanta&lt;/code&gt; and then invite people to join the channel to collaborate on local meetups.&lt;/p&gt;
&lt;h3&gt;Data Events Slack Community&lt;/h3&gt;
&lt;p&gt;The Data Events Slack Community is a great place to find people to collaborate with. Here are the existing meetup channels in the Data Events Slack Community:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meetup-argentina&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-australia&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-brazil&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-california&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-canada&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-chile&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-china&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-colombia&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-colorado&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-egypt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-florida&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-france&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-georgia&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-germany&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-illinois&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-india&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-ireland&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-israel&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-japan&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-massachusetts&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-mexico&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-netherlands&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-newyork&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-northcarolina&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-singapore&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-southafrica&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-southkorea&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-sweden&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-texas&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-uk&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-utah&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-washington&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Slack&lt;/h3&gt;
&lt;p&gt;Currently in the Apache Iceberg Slack Workspace the following Channels Exist:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meetup-atlanta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-austin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-bayarea&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-boston&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-chicago&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-denver&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-nola&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-orlando&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-seattle&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;note:&lt;/strong&gt; There are no other channels for other cities as the ability to make channels was turned off in the Iceberg Slack, my suggestion is make the channel in the Data Lakehouse Hub slack.&lt;/p&gt;
&lt;h3&gt;Data Lakehouse Hub Slack&lt;/h3&gt;
&lt;p&gt;Here are the existing meetup channels in the Data Lakehouse Hub Slack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meetup-atlanta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-austin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-barcelona&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-boston&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-chicago&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-denver&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-london&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-miami&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-munich&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-nyc&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-nola&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-orlando&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-san-francisco&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-santa-clara&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-seattle&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Apache Polaris Slack&lt;/h1&gt;
&lt;p&gt;There is a &lt;code&gt;#meetup-attendee&lt;/code&gt; and &lt;code&gt;#meetup-organizer&lt;/code&gt; channel in the Apache Polaris Slack along with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;#meetup-nyc-austin-boston-atlanta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;#meetup-sanfran-seattle-denver-chicago&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Step 3: Find or Propose Events&lt;/h1&gt;
&lt;p&gt;By reading these channels you should be able to discover upcoming Iceberg and lakehouse-related events in your area. If you want to organize an event you can propose an event and see who would want to collaborate in organizing the event.&lt;/p&gt;
&lt;h1&gt;Step 4: Organize the Event&lt;/h1&gt;
&lt;h3&gt;Naming Your Event&lt;/h3&gt;
&lt;p&gt;Simplest way to organize your event is under the name &lt;code&gt;X Lakehouse Meetup&lt;/code&gt; where &lt;code&gt;X&lt;/code&gt; is the city or region and you can run the meetup any way you like. For example, &lt;code&gt;Atlanta Lakehouse Meetup&lt;/code&gt;. But if you want to use a name like &lt;code&gt;Atlanta Apache Iceberg Meetup&lt;/code&gt; you can do that but need to follow &lt;a href=&quot;https://lists.apache.org/thread/ls2rg4xcwk9hnhtotor5f9xsrbdknw1s&quot;&gt;recently approved guidelines&lt;/a&gt; for doing so to avoid trademark issues with the Apache Software Foundation.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg should be championed&lt;/strong&gt; in every meetup &lt;em&gt;and&lt;/em&gt; technical
session (after all, we&apos;re here to support this technology and our community)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All talks should be vendor-neutral&lt;/strong&gt; and not sales pitches (of course
vendors can be mentioned, but that should never be the point of the talk)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Each meetup should have &lt;em&gt;at least&lt;/em&gt; two talks&lt;/strong&gt; with speakers
representing different companies/organizations (we need to champion
diversity of thought)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Planned meetups ought to be brought to the attention of the dev list&lt;/strong&gt;
(this is to promote transparency and raise awareness)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These rules include having an open call for speakers prior to the event and decided on the speakers among all event sponsors (and allow others to sponsor the event if they want to).&lt;/p&gt;
&lt;h3&gt;Organizing the Event&lt;/h3&gt;
&lt;p&gt;Essentially you have three main costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Venue&lt;/li&gt;
&lt;li&gt;Drinks&lt;/li&gt;
&lt;li&gt;Food&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So soliciting people to co-sponsor the event either by sharing the costs of these things or having different sponsorts pay for different things is a good way to organize the event.&lt;/p&gt;
&lt;p&gt;All contriuting sponsors should have their logos on the event promotion. You&apos;ll want all these details squared away to allow at least 2 weeks of promotion before the event if not more.&lt;/p&gt;
&lt;h3&gt;Promoting the Event&lt;/h3&gt;
&lt;p&gt;You should first either create a meetup or lu.ma listing for the event. For Apache Iceberg meetups there are community run outlets to post your event.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://lu.ma/apache-iceberg?k=c&quot;&gt;Apache Iceberg Meetups Luma Calendar&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.meetup.com/na-apache-iceberg-meetups/?eventOrigin=home_groups_you_organize&quot;&gt;North America Community Run Apache Iceberg Meetups&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here are some other Luma Calendars and Meetup Groups you may want to follow for Lakehouse Events:&lt;/p&gt;
&lt;h5&gt;Meetup Groups&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.meetup.com/north-american-open-data-lakehouse-linkups/?eventOrigin=home_groups_you_organize&quot;&gt;North American Open Lakehouse Linkups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.meetup.com/iceberg-data-lakehouse-meetups/?eventOrigin=home_groups_you_organize&quot;&gt;Open Lakehouse Meetups&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Luma Calendars&lt;/h5&gt;
&lt;p&gt;Message calendars@datalakehousehub.com to get your event added to these calendars, include link to Luma or Meetup event listing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/datalakehousemeetupsinternational?k=c&amp;amp;period=past&quot;&gt;Data Lakehouse Meetups International&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/eastcoastuslakehousemeetups?k=c&amp;amp;period=past&quot;&gt;East Cost US Open Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/westcoastlakehouse?k=c&amp;amp;period=past&quot;&gt;West Coast US Open Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/Lakehouselinkups?k=c&quot;&gt;Lakehouse Linkups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/NYCDataLakehouse?k=c&quot;&gt;NYC Data Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/Orlandodata?k=c&amp;amp;period=past&quot;&gt;Orlando Data Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Social Media&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Make sure everyone involved is posting about the event on linkedin, twitter and blue sky.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Emails&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Sponsors should send emails about the event to their lists if they can, use Luma to email attendees to remind them about the event 7 days, 24 hours and 2 hours before the event with any logistics details they should know. Offering each sponsor a link in these emails to a related blog or asset is a good idea.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Bringing together the Lakehouse and Apache Iceberg community through meetups is one of the most effective ways to foster collaboration, share knowledge, and build meaningful relationships across organizations and regions. Whether you&apos;re organizing your first meetup or joining an existing one, the open and welcoming nature of these communities makes it easy to get involved. By leveraging platforms like Slack, Luma, and Meetup, and by following best practices for organizing inclusive and impactful events, you can help grow the ecosystem and play a key role in advancing open data architectures. So jump into a meetup channel, connect with others, and start planning : your community is waiting.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is an API? And Why Data Architecture Depends on Them</title><link>https://iceberglakehouse.com/posts/2025-06-what-is-an-api/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-06-what-is-an-api/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Mon, 23 Jun 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=what-is-an-api&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=what-is-an-api&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=what-is-an-api&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Imagine walking into a restaurant in a foreign country where you don’t speak the language. You point at things, gesture wildly, maybe even draw pictures : anything to communicate what you want. But if you and the server spoke a common language like English or Spanish, things would go a lot smoother.&lt;/p&gt;
&lt;p&gt;That’s exactly what APIs do for software systems. They are shared languages that define how software components talk to each other. Without a shared API, systems can&apos;t collaborate easily, leading to miscommunication, friction, or total breakdown.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll unpack what APIs are and why they’re critical in data architecture. We&apos;ll explore the different types of APIs, how they&apos;ve shaped modern data workflows, and the standards that have emerged in key areas like storage, data transport, and cataloging. Whether you&apos;re a developer building integrations or a data architect planning your stack, understanding these APIs is essential for navigating today&apos;s complex data ecosystem.&lt;/p&gt;
&lt;h2&gt;What is an API?&lt;/h2&gt;
&lt;p&gt;An API, or Application Programming Interface, is like a contract that defines how different software components can interact. Think of it as a language specification : if two programs speak the same API, they can communicate effectively, even if they&apos;re written in different languages or run on different platforms.&lt;/p&gt;
&lt;p&gt;Just like a language has rules for grammar and vocabulary, an API defines the rules for how requests are made, what data is expected, and how responses are structured. When software follows these rules, integration becomes smooth and predictable.&lt;/p&gt;
&lt;p&gt;It&apos;s important to recognize that the term &amp;quot;API&amp;quot; can mean different things depending on context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In software development, an API can refer to the functions and methods exposed by a library or class. If one class implements the same method signatures as another, it can serve as a drop-in replacement.&lt;/li&gt;
&lt;li&gt;In system integration, APIs more commonly refer to how different applications or services communicate over a network, especially using HTTP. This includes how data is sent, what endpoints exist, and how authentication is handled.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, APIs enable modularity and collaboration in software. They allow teams to build components independently, knowing they can connect through a well-defined interface.&lt;/p&gt;
&lt;h2&gt;The Four Horsemen of HTTP APIs&lt;/h2&gt;
&lt;p&gt;When most people talk about APIs in modern software systems, they’re usually referring to HTTP-based APIs : interfaces that allow software to communicate over the web or internal networks. Over time, four main styles of HTTP APIs have emerged, each with its own strengths and trade-offs.&lt;/p&gt;
&lt;h3&gt;1. SOAP (Simple Object Access Protocol)&lt;/h3&gt;
&lt;p&gt;SOAP is a protocol-based API style that uses XML to encode messages and enforces strict standards for how messages are structured. It includes built-in specifications for things like security and error handling. While powerful, SOAP is often seen as heavyweight and complex, which has led to a decline in its use for most new applications.&lt;/p&gt;
&lt;h3&gt;2. REST (Representational State Transfer)&lt;/h3&gt;
&lt;p&gt;REST is more lightweight and flexible. It uses standard HTTP methods like GET, POST, PUT, and DELETE to perform operations on resources, which are identified via URLs. REST APIs are stateless, meaning each request contains all the information needed to process it. REST&apos;s simplicity and widespread adoption have made it the go-to style for many web services.&lt;/p&gt;
&lt;h3&gt;3. RPC (Remote Procedure Call)&lt;/h3&gt;
&lt;p&gt;RPC is all about invoking functions remotely. Instead of thinking in terms of resources, you think in terms of actions : like calling a method named &lt;code&gt;getUserDetails&lt;/code&gt;. RPC can use different serialization formats (like JSON-RPC or gRPC) and tends to be more efficient for certain tasks, especially internal service communication.&lt;/p&gt;
&lt;h3&gt;4. GraphQL&lt;/h3&gt;
&lt;p&gt;GraphQL allows clients to request exactly the data they need and nothing more. Instead of multiple endpoints, there’s typically a single endpoint that interprets a query language. This can reduce over-fetching and under-fetching of data and provides a more dynamic interface, especially useful for frontend applications.&lt;/p&gt;
&lt;p&gt;Each of these API types has its place in the ecosystem. Understanding their differences helps you pick the right tool for the job depending on complexity, flexibility, and performance needs.&lt;/p&gt;
&lt;h2&gt;Why APIs Matter in Modern Data Architecture&lt;/h2&gt;
&lt;p&gt;The modern data stack is a vibrant and diverse ecosystem. From ingestion tools and storage layers to transformation engines and visualization platforms, each component often comes from a different vendor or open-source project. The glue that holds this ecosystem together is the API.&lt;/p&gt;
&lt;p&gt;With so many tools available, the ability to integrate them seamlessly becomes a competitive advantage. Instead of reinventing the wheel, software platforms that adopt well-known APIs can plug into existing workflows and leverage established tooling. This interoperability allows teams to mix and match components without being locked into a single vendor or technology stack.&lt;/p&gt;
&lt;p&gt;For example, if two different tools both understand the same API for reading from a data catalog or writing to object storage, they can work together out of the box. This eliminates the need for custom connectors or fragile workarounds.&lt;/p&gt;
&lt;p&gt;APIs also encourage specialization. A tool can focus on doing one thing well :  like cataloging metadata or transporting data ,  and expose an API that others can build upon. This modularity is what makes today&apos;s data architectures more flexible and scalable than ever before.&lt;/p&gt;
&lt;p&gt;In short, APIs are the foundation of composability in data systems. They allow different parts of the stack to evolve independently while still working together in harmony.&lt;/p&gt;
&lt;h2&gt;Case Study – The Ubiquity of the S3 API&lt;/h2&gt;
&lt;p&gt;Amazon S3 wasn&apos;t just a game changer because it offered scalable cloud storage. It also introduced a clean, consistent API that made storing and retrieving objects over the web straightforward. This API became so widely adopted that it evolved into a de facto standard for cloud object storage.&lt;/p&gt;
&lt;p&gt;As other cloud providers and storage platforms emerged, they faced a choice: create their own APIs or adopt the S3 API. Many chose the latter. Why? Because the S3 API already had a massive ecosystem of integrations. Backup tools, data lakes, ETL pipelines, and analytics platforms already knew how to talk to S3. By supporting the S3 API, new storage services could plug into these tools without requiring any custom development.&lt;/p&gt;
&lt;p&gt;This is a powerful example of how API adoption fuels interoperability. Instead of forcing users to learn a new interface or rebuild their workflows, S3-compatible services ride the wave of existing infrastructure. As a result, users get flexibility and choice without sacrificing compatibility.&lt;/p&gt;
&lt;p&gt;The takeaway: when an API reaches critical mass, it becomes more than a technical interface : it becomes an ecosystem enabler.&lt;/p&gt;
&lt;h2&gt;Data Transport APIs – From JDBC/ODBC to ADBC&lt;/h2&gt;
&lt;p&gt;Moving data between systems has always been a core challenge in data architecture. For decades, the standard approach involved using JDBC (Java Database Connectivity) and ODBC (Open Database Connectivity). These APIs allowed applications to connect to relational databases in a consistent way, abstracting the underlying database-specific protocols.&lt;/p&gt;
&lt;p&gt;While JDBC and ODBC have served well, they come with limitations. These APIs were designed for transactional systems and row-based data access. As analytics workloads became more complex and data volumes grew, these traditional interfaces began to show performance bottlenecks.&lt;/p&gt;
&lt;p&gt;It’s also important to note that JDBC and ODBC are not HTTP-based APIs. They operate over lower-level network protocols tailored to database drivers and client libraries. This can make them harder to integrate in cloud-native or language-agnostic environments.&lt;/p&gt;
&lt;p&gt;Enter ADBC (Arrow Database Connectivity), a modern alternative designed for analytical use cases. ADBC builds on Arrow Flight, which is a gRPC-based protocol optimized for high-throughput data transport. Instead of transferring rows one by one, Arrow Flight sends columnar batches over a persistent connection, dramatically improving efficiency for analytical queries.&lt;/p&gt;
&lt;p&gt;With ADBC, the API is designed for today’s needs: fast, language-agnostic, and cloud-friendly. It embraces open standards like Apache Arrow and gRPC to deliver performance without sacrificing interoperability.&lt;/p&gt;
&lt;p&gt;As analytics platforms grow more distributed and data-hungry, APIs like ADBC represent a forward-looking approach to data transport : one that matches the scale and speed of modern data systems.&lt;/p&gt;
&lt;h2&gt;Data Catalog APIs – Hive, Glue, and Iceberg REST&lt;/h2&gt;
&lt;p&gt;Lakehouse Data catalogs store metadata about datasets :  such as schema, location, and partitioning ,  allowing tools to discover and manage data assets consistently. But for this ecosystem to function, catalogs need APIs that other tools can understand.&lt;/p&gt;
&lt;p&gt;Three primary catalog APIs have emerged in the lakehouse and analytics space:&lt;/p&gt;
&lt;h3&gt;1. Hive Metastore API&lt;/h3&gt;
&lt;p&gt;The Hive API was one of the earliest standards for metadata management in Hadoop-based systems. Because Apache Hive gained significant adoption early on, its metastore API became widely supported. Even tools that don’t use Hive for querying often support its API for interoperability.&lt;/p&gt;
&lt;h3&gt;2. AWS Glue Catalog API&lt;/h3&gt;
&lt;p&gt;As AWS became a dominant platform for cloud-native analytics, its Glue Catalog gained traction. Glue offered a managed alternative to Hive with cloud-native scalability and tight integration with AWS services. Many tools added support for Glue to integrate seamlessly within AWS ecosystems.&lt;/p&gt;
&lt;h3&gt;3. Apache Iceberg REST Catalog API&lt;/h3&gt;
&lt;p&gt;The Iceberg project initially struggled with catalog integration due to varying implementations. To solve this, the community introduced a REST-based catalog API that standardizes how tools interact with Iceberg catalogs regardless of the underlying backend. This REST interface provides a clear contract and enables broader compatibility. Catalogs that support the Iceberg REST Catalog (IRC) API include Apache Polaris (incubating), Apache Gravitino, Dremio Catalog, Open Catalog, AWS Glue Catalog, Lakekeeper, Nessie, Unity Catalog and many more. Most specialized Iceberg tooling uses this as the main catalog API for discovering your Apache Iceberg datasets while catalogs like Polaris, Gravitino and Unity also adopt other APIs to make additional datasets discoverable.&lt;/p&gt;
&lt;p&gt;Today, most lakehouse tools support one or more of these APIs to ensure compatibility across different environments. Whether you&apos;re working with on-prem systems using Hive, cloud-native stacks using Glue, or modern lakehouse engines built around Iceberg, API adoption remains the key to ecosystem integration.&lt;/p&gt;
&lt;p&gt;Choosing catalog tools that support these APIs ensures you&apos;re building on a foundation that promotes interoperability, flexibility, and future-proofing.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;APIs are more than just technical interfaces : they are the connective tissue of modern software. In data architecture, where tools span a wide range of functions and vendors, APIs enable these components to work together smoothly.&lt;/p&gt;
&lt;p&gt;We’ve seen how APIs act like shared languages, allowing software to communicate efficiently. From foundational HTTP-based APIs like REST and GraphQL, to specialized data interfaces like the S3 API, JDBC, ADBC, and various catalog APIs, each plays a role in shaping the data landscape.&lt;/p&gt;
&lt;p&gt;By adopting established APIs, tools become more compatible, easier to integrate, and more valuable within the broader ecosystem. And for data teams, aligning on common APIs means less time wrestling with custom connectors and more time delivering insights.&lt;/p&gt;
&lt;p&gt;As the data world continues to evolve, understanding and leveraging key APIs is essential. They’re not just part of the plumbing : they’re a strategic asset for building robust, scalable, and flexible data systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Decoding AWS EC2 Instance Type Names</title><link>https://iceberglakehouse.com/posts/2025-06-AWS-Instance-Types/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-06-AWS-Instance-Types/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Wed, 18 Jun 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;If you&apos;ve ever browsed AWS EC2 instance types and found yourself staring blankly at names like &lt;code&gt;m5.large&lt;/code&gt;, &lt;code&gt;c6g.xlarge&lt;/code&gt;, or &lt;code&gt;r7a.2xlarge&lt;/code&gt;, you&apos;re not alone. At first glance, these names can feel cryptic - like trying to decode a secret code.&lt;/p&gt;
&lt;p&gt;But here&apos;s the good news: there&apos;s a method to the madness. Each part of an instance type name tells you something important about the underlying hardware, performance characteristics, and intended use case.&lt;/p&gt;
&lt;p&gt;In this blog post, we&apos;ll break down the structure of AWS instance type names and show you how to read them like a pro. Once you understand how to interpret each component, you&apos;ll be able to confidently choose the right instance for your workload - and maybe even impress your colleagues with your cloud fluency.&lt;/p&gt;
&lt;h2&gt;The Anatomy of an Instance Type&lt;/h2&gt;
&lt;p&gt;Every AWS EC2 instance type name is composed of distinct parts that reveal critical details about the instance&apos;s capabilities. The general structure looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[family][generation][optional suffix].[size]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take the instance type &lt;code&gt;c6g.large&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;c&lt;/code&gt; → &lt;strong&gt;Compute optimized&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;6&lt;/code&gt; → &lt;strong&gt;6th generation hardware&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;g&lt;/code&gt; → &lt;strong&gt;Powered by AWS Graviton (ARM-based processor)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;large&lt;/code&gt; → &lt;strong&gt;Medium-sized instance (typically 2 vCPUs and 4 GB RAM)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By understanding what each segment means, you can quickly assess whether an instance is optimized for compute, memory, storage, or GPU, and how big or powerful it is.&lt;/p&gt;
&lt;p&gt;In the sections below, we’ll walk through each part of the name in more detail.&lt;/p&gt;
&lt;h2&gt;Family – What Is the Instance Optimized For?&lt;/h2&gt;
&lt;p&gt;The first letter (or set of letters) in an instance type indicates the &lt;strong&gt;instance family&lt;/strong&gt;, which tells you what the instance is optimized for. This helps guide your choice based on the nature of your workload - whether you need general-purpose performance, high CPU, large memory, or GPU acceleration.&lt;/p&gt;
&lt;p&gt;Here’s a quick overview of the most common instance families:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Common Use Cases&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Burstable general purpose&lt;/td&gt;
&lt;td&gt;Development, low-traffic websites&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;m&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;General purpose&lt;/td&gt;
&lt;td&gt;Balanced CPU and memory workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Compute optimized&lt;/td&gt;
&lt;td&gt;High-performance computing, batch processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;r&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Memory optimized&lt;/td&gt;
&lt;td&gt;In-memory databases, real-time analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extra memory optimized&lt;/td&gt;
&lt;td&gt;SAP HANA, memory-intensive enterprise apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;i&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Storage optimized (high IOPS)&lt;/td&gt;
&lt;td&gt;NoSQL databases, large transactional systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;g&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU instances&lt;/td&gt;
&lt;td&gt;Machine learning, video rendering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;High-performance GPU&lt;/td&gt;
&lt;td&gt;Deep learning training, scientific modeling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;h&lt;/code&gt;, &lt;code&gt;d&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Specialized families&lt;/td&gt;
&lt;td&gt;Varies (HPC, local storage, high-frequency)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Understanding the family is the first step in selecting the right instance. For example, if your application is CPU-bound, a &lt;code&gt;c&lt;/code&gt; family instance will typically deliver better performance per dollar than an &lt;code&gt;m&lt;/code&gt; or &lt;code&gt;t&lt;/code&gt; instance.&lt;/p&gt;
&lt;h2&gt;Generation – How New Is the Hardware?&lt;/h2&gt;
&lt;p&gt;The number immediately following the family letter represents the &lt;strong&gt;generation&lt;/strong&gt; of the instance. AWS continuously improves its infrastructure, and newer generations typically offer better performance, energy efficiency, and cost-effectiveness compared to older ones.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;m4&lt;/code&gt; → 4th generation general-purpose instance&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m5&lt;/code&gt; → Newer 5th generation version&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m6g&lt;/code&gt; → 6th generation with Graviton (ARM-based processor)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why It Matters:&lt;/h3&gt;
&lt;p&gt;Choosing a newer generation instance usually means access to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improved CPUs (e.g., Intel Ice Lake, AMD EPYC, or AWS Graviton)&lt;/li&gt;
&lt;li&gt;Better network and storage throughput&lt;/li&gt;
&lt;li&gt;Lower cost for similar or better performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That said, not all regions have the latest generation available. Always check your region’s instance offerings and benchmark critical workloads if performance is a top priority.&lt;/p&gt;
&lt;h2&gt;Suffix – Special Chips or Capabilities&lt;/h2&gt;
&lt;p&gt;Some instance types include an optional &lt;strong&gt;suffix&lt;/strong&gt;: a letter (or combination of letters) that provides additional detail about the instance’s hardware or features. These suffixes appear immediately after the generation number and can help you identify special variants optimized for particular use cases.&lt;/p&gt;
&lt;h3&gt;Common Suffixes and What They Mean:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Suffix&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AMD EPYC processor&lt;/td&gt;
&lt;td&gt;Cost-effective alternative to Intel-based instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;g&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AWS Graviton processor (ARM-based)&lt;/td&gt;
&lt;td&gt;Energy-efficient, high performance, lower cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;n&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Network-optimized&lt;/td&gt;
&lt;td&gt;Enhanced network bandwidth and performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Includes local NVMe storage&lt;/td&gt;
&lt;td&gt;Fast local instance storage for low-latency workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;e&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extended memory or enhanced features&lt;/td&gt;
&lt;td&gt;More memory or improved capabilities per vCPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;High-frequency Intel CPUs&lt;/td&gt;
&lt;td&gt;For workloads that need very high clock speed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Example:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r6a&lt;/code&gt; → Memory optimized (r), 6th generation, AMD processor (a)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m6g&lt;/code&gt; → General purpose (m), 6th generation, Graviton processor (g)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;i3d&lt;/code&gt; → Storage optimized (i), 3rd generation, with NVMe instance store (d)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These suffixes allow you to fine-tune your instance selection based on price, performance, or architecture preferences - especially important if your software is architecture-sensitive (e.g., x86 vs ARM).&lt;/p&gt;
&lt;h2&gt;Size – How Big Is the Instance?&lt;/h2&gt;
&lt;p&gt;The part of the instance type that comes &lt;strong&gt;after the period (&lt;code&gt;.&lt;/code&gt;)&lt;/strong&gt; defines the &lt;strong&gt;size&lt;/strong&gt; of the instance. This determines how many vCPUs, how much memory, and sometimes how much networking or storage bandwidth is allocated.&lt;/p&gt;
&lt;p&gt;AWS uses consistent naming for sizes across instance families:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Typical vCPUs&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.nano&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Very small&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;For ultra-light workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.micro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Entry-level, burstable performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Modest&lt;/td&gt;
&lt;td&gt;1–2&lt;/td&gt;
&lt;td&gt;Slightly more consistent CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.medium&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;1–2&lt;/td&gt;
&lt;td&gt;Balanced for small apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.large&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2x baseline&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Common for dev/test workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4x baseline&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Heavier compute or memory needs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.2xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8x baseline&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Medium to large production loads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.4xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16x baseline&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;High-capacity apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.8xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;32x baseline&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;Data processing, analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.12xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;48x baseline&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;High-scale enterprise workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.24xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;96x baseline&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;td&gt;Very high-performance computing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.metal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bare metal (no hypervisor)&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Full access to physical server&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Example:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;m5.large&lt;/code&gt; = General-purpose instance, 5th generation, with 2 vCPUs and 8 GB memory.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;c6g.4xlarge&lt;/code&gt; = Compute optimized, 6th gen, Graviton processor, with 16 vCPUs and 32 GB memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choosing the right size allows you to scale &lt;strong&gt;vertically&lt;/strong&gt; by increasing resources within a single instance, or &lt;strong&gt;horizontally&lt;/strong&gt; by adding more instances of a smaller size depending on your architecture and cost goals.&lt;/p&gt;
&lt;h2&gt;Pulling It All Together&lt;/h2&gt;
&lt;p&gt;Now that you understand each component: &lt;strong&gt;family&lt;/strong&gt;, &lt;strong&gt;generation&lt;/strong&gt;, &lt;strong&gt;suffix&lt;/strong&gt;, and &lt;strong&gt;size&lt;/strong&gt;, you can decode any EC2 instance type and understand exactly what it offers.&lt;/p&gt;
&lt;p&gt;Let’s break down a few examples to reinforce what you’ve learned:&lt;/p&gt;
&lt;h3&gt;🔹 Example 1: &lt;code&gt;c6g.large&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;c&lt;/code&gt; → Compute optimized&lt;/li&gt;
&lt;li&gt;&lt;code&gt;6&lt;/code&gt; → 6th generation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;g&lt;/code&gt; → AWS Graviton (ARM-based processor)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;large&lt;/code&gt; → Medium-sized (2 vCPUs, ~4 GB RAM)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Great for compute-heavy applications running on ARM, like containerized services or microservices at scale.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;🔹 Example 2: &lt;code&gt;r5d.4xlarge&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r&lt;/code&gt; → Memory optimized&lt;/li&gt;
&lt;li&gt;&lt;code&gt;5&lt;/code&gt; → 5th generation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;d&lt;/code&gt; → Includes local NVMe SSD instance store&lt;/li&gt;
&lt;li&gt;&lt;code&gt;4xlarge&lt;/code&gt; → 16 vCPUs and 128 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Ideal for high-throughput, in-memory databases or data processing that benefits from fast local storage.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;🔹 Example 3: &lt;code&gt;m7a.xlarge&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;m&lt;/code&gt; → General purpose&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7&lt;/code&gt; → 7th generation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;a&lt;/code&gt; → AMD EPYC processor&lt;/li&gt;
&lt;li&gt;&lt;code&gt;xlarge&lt;/code&gt; → 4 vCPUs, 16 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Balanced workloads where cost-effectiveness is important, such as web applications or business logic layers.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Understanding how to read these names makes it easier to compare instance types, choose the best fit for your application, and avoid over-provisioning. You’ll save money, optimize performance, and build with more confidence on AWS.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | What is Data Engineering?</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-01/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-01/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data engineering sits at the heart of modern data-driven organizations. While data science often grabs headlines with predictive models and AI, it&apos;s the data engineer who builds and maintains the infrastructure that makes all of that possible. In this first post of our series, we’ll explore what data engineering is, why it matters, and how it fits into the broader data ecosystem.&lt;/p&gt;
&lt;h2&gt;The Role of the Data Engineer&lt;/h2&gt;
&lt;p&gt;Think of a data engineer as the architect and builder of the data highways. These professionals design, construct, and maintain systems that move, transform, and store data efficiently. Their job is to ensure that data flows from various sources into data warehouses or lakes where it can be used reliably for analysis, reporting, and machine learning.&lt;/p&gt;
&lt;p&gt;In a practical sense, this means working with pipelines that connect everything from transactional databases and API feeds to large-scale storage systems. Data engineers work closely with data analysts, scientists, and platform teams to ensure the data is clean, consistent, and available when needed.&lt;/p&gt;
&lt;h2&gt;From Raw to Refined: The Journey of Data&lt;/h2&gt;
&lt;p&gt;Raw data is rarely useful as-is. It often arrives incomplete, messy, or inconsistently formatted. Data engineers are responsible for shepherding this raw material through a series of processing stages to prepare it for consumption.&lt;/p&gt;
&lt;p&gt;This involves tasks like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data ingestion (bringing data in from various sources)&lt;/li&gt;
&lt;li&gt;Data transformation (cleaning, enriching, and reshaping the data)&lt;/li&gt;
&lt;li&gt;Data storage (choosing optimal formats and storage solutions)&lt;/li&gt;
&lt;li&gt;Data delivery (ensuring end users can access data quickly and easily)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At each stage, considerations around scalability, performance, security, and governance come into play.&lt;/p&gt;
&lt;h2&gt;Data Engineering vs Data Science&lt;/h2&gt;
&lt;p&gt;It&apos;s common to see some confusion between the roles of data engineers and data scientists. While their work is often complementary, their responsibilities are distinct.&lt;/p&gt;
&lt;p&gt;A data scientist focuses on analyzing data and building predictive models. Their tools often include Python, R, and statistical frameworks. On the other hand, data engineers build the systems that make the data usable in the first place. They are often more focused on infrastructure, system design, and optimization.&lt;/p&gt;
&lt;p&gt;In short: the data scientist asks questions; the data engineer ensures the data is ready to answer them.&lt;/p&gt;
&lt;h2&gt;A Brief History of the Data Stack&lt;/h2&gt;
&lt;p&gt;The evolution of data engineering can be seen in how the data stack has changed over time.&lt;/p&gt;
&lt;p&gt;In traditional environments, organizations relied heavily on ETL tools to move data from relational databases into on-premise warehouses. These systems were tightly controlled but not particularly flexible or scalable.&lt;/p&gt;
&lt;p&gt;With the rise of big data, open-source tools like Hadoop and Spark introduced new ways to process data at scale. More recently, cloud-native services and modern orchestration frameworks have enabled even more agility and scalability in data workflows.&lt;/p&gt;
&lt;p&gt;This evolution has led to concepts like the &lt;strong&gt;modern data stack&lt;/strong&gt; and &lt;strong&gt;data lakehouse&lt;/strong&gt; - topics we’ll cover later in this series.&lt;/p&gt;
&lt;h2&gt;Why It Matters&lt;/h2&gt;
&lt;p&gt;Every modern organization depends on data. But without a solid foundation, data becomes a liability rather than an asset. Poorly managed data can lead to flawed insights, compliance issues, and lost opportunities.&lt;/p&gt;
&lt;p&gt;Good data engineering practices ensure that data is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Accurate and timely&lt;/li&gt;
&lt;li&gt;Secure and compliant&lt;/li&gt;
&lt;li&gt;Scalable and performant&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a world where data volumes and velocity are only increasing, the importance of data engineering will only continue to grow.&lt;/p&gt;
&lt;h2&gt;What’s Next&lt;/h2&gt;
&lt;p&gt;Now that we’ve outlined the role and importance of data engineering, the next step is to explore how data gets into a system in the first place. In the next post, we’ll dig into data sources and the ingestion process - how data flows from the outside world into your ecosystem.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Understanding Data Sources and Ingestion</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-02/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-02/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Before we can analyze, model, or visualize data, we first need to get it into our systems. This step: often taken for granted, is known as data ingestion. It’s the bridge between the outside world and the internal data infrastructure, and it plays a critical role in how data is shaped from day one.&lt;/p&gt;
&lt;p&gt;In this post, we’ll break down the types of data sources you’ll encounter, the ingestion strategies available, and what trade-offs to consider when designing ingestion workflows.&lt;/p&gt;
&lt;h2&gt;What Are Data Sources?&lt;/h2&gt;
&lt;p&gt;At its core, a data source is any origin point from which data can be extracted. These sources vary widely in structure, velocity, and complexity.&lt;/p&gt;
&lt;p&gt;Relational databases like MySQL or PostgreSQL are common sources in transactional systems. They tend to produce highly structured, row-based data and are often central to business operations such as order processing or customer management.&lt;/p&gt;
&lt;p&gt;APIs are another rich source of data, especially in modern SaaS environments. From financial data to social media feeds, APIs expose endpoints where structured (often JSON-formatted) data can be requested in real-time or on a schedule.&lt;/p&gt;
&lt;p&gt;Then there are flat files: CSV, JSON, XML, often used in data exports, logs, and external data sharing. While simple, they can carry critical context or fill gaps that structured sources miss.&lt;/p&gt;
&lt;p&gt;Sensor data, clickstreams, mobile apps, third-party tools, and message queues all add to the landscape, each bringing its own cadence and complexity.&lt;/p&gt;
&lt;h2&gt;Ingestion Strategies: Batch vs Streaming&lt;/h2&gt;
&lt;p&gt;Once you identify your sources, the next question becomes: &lt;strong&gt;how&lt;/strong&gt; will you ingest the data?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch ingestion&lt;/strong&gt; involves collecting data at intervals and processing it in chunks. This could be once a day, every hour, or even every minute. It&apos;s suitable for systems that don&apos;t require real-time updates and where data can afford to be a little stale. For example, nightly financial reports or end-of-day sales data.&lt;/p&gt;
&lt;p&gt;Batch processes tend to be simpler and easier to maintain. They can rely on traditional extract-transform-load (ETL) workflows and are often orchestrated using tools like Apache Airflow or simple cron jobs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Streaming ingestion&lt;/strong&gt;, on the other hand, handles data in motion. As new records are created: say, a customer clicks a link or a sensor detects a temperature change, they’re ingested immediately. This method is crucial for use cases that require low-latency or real-time processing, such as fraud detection or live recommendation engines.&lt;/p&gt;
&lt;p&gt;Apache Kafka is a popular tool for enabling streaming pipelines. It allows systems to publish and subscribe to streams of records, ensuring data flows continuously with minimal delay.&lt;/p&gt;
&lt;h2&gt;Structured, Semi-Structured, and Unstructured Data&lt;/h2&gt;
&lt;p&gt;Understanding the shape of your data also influences how you ingest it.&lt;/p&gt;
&lt;p&gt;Structured data is highly organized and fits neatly into tables. Think SQL databases or CSV files. Ingestion here often involves direct connections via JDBC drivers, SQL queries, or file uploads.&lt;/p&gt;
&lt;p&gt;Semi-structured data, like JSON or XML, has an internal structure but doesn’t conform strictly to relational models. Ingesting this data may require parsing logic and schema inference before it&apos;s usable downstream.&lt;/p&gt;
&lt;p&gt;Unstructured data includes images, videos, PDFs, and raw text. These formats typically require specialized tools and more complex handling, often involving metadata extraction or integration with machine learning models for classification or tagging.&lt;/p&gt;
&lt;h2&gt;Considerations in Designing Ingestion Pipelines&lt;/h2&gt;
&lt;p&gt;Data ingestion isn’t just about moving bytes - it’s about doing so reliably, efficiently, and with the future in mind.&lt;/p&gt;
&lt;p&gt;Latency requirements play a major role. Does the business need data as it happens, or is yesterday’s data good enough? That determines your choice between batch and streaming.&lt;/p&gt;
&lt;p&gt;Scalability is another concern. What works for 10,000 records a day might break under 10 million. Tools like Kafka and cloud-native services such as AWS Kinesis or Google Pub/Sub help handle high throughput without compromising performance.&lt;/p&gt;
&lt;p&gt;Error handling is essential. What happens if a source API goes down? What if a file arrives with missing fields? Designing retry logic, alerts, and fallback mechanisms helps ensure ingestion pipelines are robust.&lt;/p&gt;
&lt;p&gt;Finally, schema evolution can’t be overlooked. Data changes over time - columns get added, data types shift. Your ingestion pipeline must be flexible enough to adapt without breaking downstream systems.&lt;/p&gt;
&lt;h2&gt;Looking Ahead&lt;/h2&gt;
&lt;p&gt;Getting data into the system is just the beginning. Once it’s ingested, it often needs to be transformed to fit the analytical or business context.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore the concepts of ETL and ELT: two core paradigms for moving and transforming data, and look at how they differ in practice and purpose.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | ETL vs ELT – Understanding Data Pipelines</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-03/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-03/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once data has been ingested into your system, the next step is to prepare it for actual use. This typically involves cleaning, transforming, and storing the data in a way that supports analysis, reporting, or further processing. This is where data pipelines come in, and at the center of pipeline design are two common strategies: ETL and ELT.&lt;/p&gt;
&lt;p&gt;Although they may look similar at first glance, ETL and ELT represent fundamentally different approaches to handling data transformations, and each has its strengths and trade-offs depending on the context in which it’s used.&lt;/p&gt;
&lt;h2&gt;What is ETL?&lt;/h2&gt;
&lt;p&gt;ETL stands for Extract, Transform, Load. It’s the traditional method used in many enterprise environments for years. The process starts by &lt;strong&gt;extracting&lt;/strong&gt; data from source systems such as databases, APIs, or flat files. This raw data is then &lt;strong&gt;transformed&lt;/strong&gt;: typically on a separate processing server or ETL engine, before it is finally &lt;strong&gt;loaded&lt;/strong&gt; into a data warehouse or other destination system.&lt;/p&gt;
&lt;p&gt;For example, imagine a retail company collecting daily sales data from multiple stores. In an ETL workflow, the system might extract those records at the end of the day, standardize formats, filter out corrupted rows, aggregate sales by region, and then load the clean, transformed dataset into a reporting warehouse like Snowflake or Redshift.&lt;/p&gt;
&lt;p&gt;One of the key advantages of ETL is that it allows you to load only clean, verified data into your warehouse. That often means smaller storage footprints and potentially better performance on downstream queries.&lt;/p&gt;
&lt;p&gt;However, this approach also has limitations. Because the transformation happens before loading, you must decide upfront how the data should be shaped. If business rules change or additional use cases emerge, you may need to go back and reprocess the data.&lt;/p&gt;
&lt;h2&gt;What is ELT?&lt;/h2&gt;
&lt;p&gt;ELT reverses the order of the last two steps: Extract, Load, Transform. In this model, raw data is extracted from the source and immediately &lt;strong&gt;loaded&lt;/strong&gt; into the target system - usually a cloud data warehouse that can scale horizontally. Once the data is in place, transformations are performed &lt;strong&gt;within&lt;/strong&gt; the warehouse using SQL or warehouse-native tools.&lt;/p&gt;
&lt;p&gt;This approach takes advantage of the high compute power and scalability of modern cloud platforms. Instead of bottlenecking on a dedicated ETL server, the warehouse can handle complex joins, aggregations, and transformations at scale.&lt;/p&gt;
&lt;p&gt;Let’s go back to the retail example. With ELT, all sales data is loaded as-is into the warehouse. Analysts or data engineers can then write transformation scripts to reshape the data for various use cases: trend analysis, regional comparisons, or fraud detection, all without having to re-ingest or reload the source data.&lt;/p&gt;
&lt;p&gt;ELT offers more flexibility for evolving requirements, supports broader self-service analytics, and enables faster time-to-insight. The trade-off is that it requires strong governance and monitoring. Because raw data is stored in the warehouse, the risk of exposing inconsistent or unclean data is higher if transformation logic isn’t managed carefully.&lt;/p&gt;
&lt;h2&gt;Choosing Between ETL and ELT&lt;/h2&gt;
&lt;p&gt;The decision to use ETL or ELT often depends on your stack, performance needs, and organizational practices.&lt;/p&gt;
&lt;p&gt;ETL still makes sense in environments with strict data governance, limited warehouse compute resources, or scenarios where only clean data should be retained. It’s also common in legacy systems and on-premise architectures.&lt;/p&gt;
&lt;p&gt;ELT shines in modern cloud-native environments where scalability and agility are top priorities. It’s often used with platforms like Snowflake, BigQuery, or Redshift, which are built to handle large volumes of raw data and complex SQL-based transformations efficiently.&lt;/p&gt;
&lt;p&gt;In practice, many organizations use a hybrid approach. Critical data may go through an ETL flow, while experimental or rapidly evolving datasets follow an ELT pattern.&lt;/p&gt;
&lt;h2&gt;The Bigger Picture&lt;/h2&gt;
&lt;p&gt;ETL and ELT are just different roads to the same destination: getting data ready for use. As the modern data stack evolves, so do the tools and best practices for managing these flows. Whether you choose one approach or blend both, what matters most is building pipelines that are reliable, maintainable, and aligned with your organization’s goals.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll focus on batch processing: the traditional foundation of many ETL workflows, and discuss how data engineers design, schedule, and optimize these processes for scale.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Batch Processing Fundamentals</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-04/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-04/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For many data engineering tasks, real-time insights aren’t necessary. In fact, a large portion of the data processed across organizations happens in scheduled intervals - daily sales reports, weekly data refreshes, monthly billing cycles. This is where batch processing comes in, and despite the growing popularity of streaming, batch remains the backbone of many data-driven workflows.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what batch processing is, how it works under the hood, and why it’s still a critical technique in the data engineer’s toolbox.&lt;/p&gt;
&lt;h2&gt;What is Batch Processing?&lt;/h2&gt;
&lt;p&gt;Batch processing is the execution of data workflows on a predefined schedule or in response to specific triggers. Instead of processing data as it arrives, the system collects a set of data over a period of time, then processes that set as a single unit.&lt;/p&gt;
&lt;p&gt;This approach is particularly useful when data arrives in large quantities but doesn’t need to be acted on immediately. For example, processing daily transactions from a point-of-sale system or generating overnight reports for executive dashboards.&lt;/p&gt;
&lt;p&gt;Batch jobs are often triggered at set times: say, every night at 2 a.m., and are designed to run until completion, often without user interaction. They can run for seconds, minutes, or even hours depending on the volume of data and complexity of the transformations.&lt;/p&gt;
&lt;h2&gt;Under the Hood: How Batch Jobs Work&lt;/h2&gt;
&lt;p&gt;The anatomy of a batch job usually includes several stages. First, the job identifies the data it needs to process. This might involve querying a database for all records created in the last 24 hours or scanning a specific folder in object storage for new files.&lt;/p&gt;
&lt;p&gt;Next comes the transformation phase. This is where data is cleaned, filtered, joined with other datasets, and reshaped to fit its target structure. This phase can include tasks like date formatting, currency conversion, null value imputation, or the calculation of derived fields.&lt;/p&gt;
&lt;p&gt;Finally, the job writes the transformed data to its destination - often a data warehouse, data lake, or downstream reporting system.&lt;/p&gt;
&lt;p&gt;To manage all of this, engineers rely on workflow orchestration tools. These tools provide scheduling, error handling, and logging capabilities to ensure that jobs run in the right order and can recover gracefully from failure.&lt;/p&gt;
&lt;h2&gt;Tools and Technologies&lt;/h2&gt;
&lt;p&gt;Several tools have become staples in batch-oriented workflows. Apache Airflow is one of the most widely used. It allows engineers to define complex workflows as Directed Acyclic Graphs (DAGs), where each node represents a task and dependencies are explicitly declared.&lt;/p&gt;
&lt;p&gt;Other tools like Luigi and Oozie offer similar functionality, though they are less commonly used in newer stacks. Cloud-native platforms such as AWS Glue and Google Cloud Composer provide managed orchestration services that integrate tightly with the respective cloud ecosystems.&lt;/p&gt;
&lt;p&gt;In addition to orchestration, batch jobs often depend on distributed processing engines like Apache Spark. Spark allows massive datasets to be processed in parallel across a cluster of machines, reducing processing times dramatically compared to traditional single-node tools.&lt;/p&gt;
&lt;h2&gt;Strengths and Limitations&lt;/h2&gt;
&lt;p&gt;One of the biggest advantages of batch processing is its simplicity. Since data is processed in chunks, you can apply robust validation and error-handling routines before moving data downstream. It&apos;s also easier to track and audit, which is especially important for regulated industries.&lt;/p&gt;
&lt;p&gt;Batch jobs are also cost-efficient when working with large volumes of data that don’t require immediate availability. Processing once per day means you can spin up compute resources only when needed, rather than keeping systems running continuously.&lt;/p&gt;
&lt;p&gt;However, the main limitation is latency. If something happens in your business: say, a spike in fraudulent transactions, you won’t know about it until after the next batch job runs. For use cases that require faster insights or real-time responsiveness, batch processing isn’t sufficient.&lt;/p&gt;
&lt;p&gt;There’s also the issue of windowing and completeness. Since batch jobs process data in slices, late-arriving records can fall outside the intended window unless carefully managed. This adds complexity to pipeline design and requires thoughtful handling of time-based logic.&lt;/p&gt;
&lt;h2&gt;Where Batch Still Shines&lt;/h2&gt;
&lt;p&gt;Despite its limitations, batch processing remains ideal for a wide range of use cases. Financial reconciliations, data archival, slow-changing dimensional data updates, and long-running analytics workloads are just a few examples where batch continues to dominate.&lt;/p&gt;
&lt;p&gt;As a data engineer, understanding how to design efficient and reliable batch workflows is an essential skill, especially in environments where consistency and auditability are critical.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore the counterpart to batch: streaming data processing. We’ll look at what it means to process data in real time, how it differs from batch, and what patterns and tools make it work.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Streaming Data Fundamentals</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-05/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-05/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In contrast to batch processing, where data is collected and processed in chunks, streaming data processing deals with data in motion. Instead of waiting for data to accumulate before running transformations, streaming pipelines ingest and process each piece of data as it arrives. This model enables organizations to respond to events in real time, a capability that’s becoming increasingly essential in domains like finance, security, and customer experience.&lt;/p&gt;
&lt;p&gt;In this post, we’ll unpack the core ideas behind streaming, how it works in practice, and the challenges it presents compared to traditional batch systems.&lt;/p&gt;
&lt;h2&gt;What is Streaming Data?&lt;/h2&gt;
&lt;p&gt;Streaming data refers to data that is continuously generated by various sources: website clicks, IoT sensors, user interactions, system logs, and transmitted in real time or near-real time. This data typically arrives in small payloads, often as individual events, and needs to be processed with minimal delay.&lt;/p&gt;
&lt;p&gt;The goal of a streaming pipeline is to capture this data as it’s generated, perform necessary transformations, and deliver it to its destination with as little latency as possible.&lt;/p&gt;
&lt;p&gt;A simple example would be a ride-sharing app that tracks vehicle locations in real time. As each car moves, GPS data is streamed to a backend system that updates the user interface and helps dispatch rides based on current conditions.&lt;/p&gt;
&lt;h2&gt;How Streaming Systems Work&lt;/h2&gt;
&lt;p&gt;Unlike batch jobs that execute on a schedule, streaming systems run continuously. They consume data from a source, process it incrementally, and push it to a sink: all without waiting for a dataset to be complete.&lt;/p&gt;
&lt;p&gt;At the heart of a streaming system is a message broker or event queue, which acts as a buffer between data producers and consumers. Apache Kafka is a popular choice here. It allows producers to publish events to topics, and consumers to read from those topics independently, often with strong guarantees around ordering and durability.&lt;/p&gt;
&lt;p&gt;Once events are ingested, a processing engine takes over. Tools like Apache Flink, Spark Structured Streaming, and Apache Beam allow developers to apply transformations on a per-record basis or over time-based windows. This is where operations like filtering, aggregating, joining, and enriching occur.&lt;/p&gt;
&lt;p&gt;These transformations must be designed to handle data that may arrive late, out of order, or in bursts. As such, streaming systems often implement complex logic to manage time: distinguishing between event time (when the event occurred) and processing time (when it was received), to ensure results are accurate.&lt;/p&gt;
&lt;h2&gt;Use Cases and Business Impact&lt;/h2&gt;
&lt;p&gt;The appeal of streaming pipelines lies in their ability to power real-time applications. Fraud detection systems can flag suspicious transactions as they happen. E-commerce platforms can recommend products based on live browsing behavior. Logistics companies can monitor fleet activity and adjust routes on the fly.&lt;/p&gt;
&lt;p&gt;In operational analytics, dashboards fed by streaming data provide up-to-the-minute visibility, allowing teams to make informed decisions in response to changing conditions.&lt;/p&gt;
&lt;p&gt;Streaming is also a foundational component of event-driven architectures. When services communicate via events, streaming systems act as the glue that ties the application together, enabling asynchronous, decoupled interactions.&lt;/p&gt;
&lt;h2&gt;Challenges in Streaming Systems&lt;/h2&gt;
&lt;p&gt;Despite its power, streaming introduces complexity that shouldn’t be underestimated. Handling late or out-of-order data is a major concern. If an event shows up ten minutes after it was supposed to be processed, the system must be smart enough to either incorporate it correctly or account for the gap.&lt;/p&gt;
&lt;p&gt;State management is another critical factor. When a pipeline needs to remember information across multiple events: like keeping a running total or maintaining a session, it must manage that state reliably, often across distributed systems.&lt;/p&gt;
&lt;p&gt;There’s also the issue of fault tolerance. Streaming systems must be able to recover from crashes or network issues without duplicating results or losing data. This requires sophisticated checkpointing, replay, and exactly-once processing semantics, which tools like Flink and Beam are designed to provide.&lt;/p&gt;
&lt;p&gt;Finally, testing and debugging streaming pipelines can be more difficult than batch jobs. Because they run continuously and deal with time-sensitive data, reproducing issues often requires specialized tooling or replay mechanisms.&lt;/p&gt;
&lt;h2&gt;When to Choose Streaming&lt;/h2&gt;
&lt;p&gt;Streaming makes sense when low-latency data processing is essential to the business. This could mean operational decision-making, customer experience personalization, or complex event processing in a microservices architecture.&lt;/p&gt;
&lt;p&gt;It’s not always the right tool for the job, though. For workloads that don’t require immediate insights: or where simplicity and reliability matter more, batch processing remains the better choice.&lt;/p&gt;
&lt;p&gt;As data engineers, the key is to understand the trade-offs and choose the right pattern for each use case.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll shift gears and look at how data is modeled for analytics. Understanding the differences between OLTP and OLAP systems, as well as the pros and cons of different schema designs, is critical to building pipelines that serve real business needs.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Modeling Basics</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-06/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-06/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Behind every useful dashboard or analytics report lies a well-structured data model. Data modeling is the practice of shaping data into organized structures that are easy to query, analyze, and maintain. While it may sound abstract, modeling directly impacts how quickly and accurately data consumers can extract value from the information stored in your systems.&lt;/p&gt;
&lt;p&gt;In this post, we’ll look at the foundations of data modeling, the difference between OLTP and OLAP systems, and common schema designs that data engineers use to build efficient and scalable data platforms.&lt;/p&gt;
&lt;h2&gt;Why Data Modeling Matters&lt;/h2&gt;
&lt;p&gt;When data arrives from source systems, it’s often raw and optimized for transactions, not analysis. A transactional database might record every sale or click in granular detail, but that structure doesn’t translate well into aggregations like “monthly revenue by product category.”&lt;/p&gt;
&lt;p&gt;A data model reshapes that data to make it usable. Good models reduce complexity, improve performance, and minimize errors. Poor models, on the other hand, lead to slow queries, redundant data, and confusion about what numbers really mean.&lt;/p&gt;
&lt;p&gt;Modeling is both a technical and a collaborative process. It requires not just understanding how data is structured, but also how the business thinks about that data - what questions need answering, how metrics are defined, and what trade-offs are acceptable.&lt;/p&gt;
&lt;h2&gt;OLTP vs OLAP: Two Worlds, Two Purposes&lt;/h2&gt;
&lt;p&gt;Before diving into specific modeling techniques, it’s important to distinguish between the two main types of data systems: OLTP and OLAP.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OLTP (Online Transaction Processing)&lt;/strong&gt; systems are built for real-time operations. Think of point-of-sale systems, user authentication services, or banking apps. These systems are optimized for high-throughput reads and writes, handling thousands of small transactions per second. Their schemas are typically highly normalized to avoid data duplication and to keep updates fast and consistent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt; systems, on the other hand, are designed for analysis. These platforms support complex queries over large volumes of historical data. Performance here is about aggregating, filtering, and summarizing - not handling rapid transactions. Because of this, OLAP models often trade strict normalization for faster access to pre-joined or denormalized data.&lt;/p&gt;
&lt;p&gt;Understanding whether your system is OLTP or OLAP helps determine how you model your data. The techniques and trade-offs are different depending on the system’s purpose.&lt;/p&gt;
&lt;h2&gt;Normalization and Denormalization&lt;/h2&gt;
&lt;p&gt;In OLTP systems, normalization is the standard. This means structuring data so that each fact is stored in exactly one place. For example, instead of storing a customer’s name with every order record, you keep customer details in a separate table and reference them via a key.&lt;/p&gt;
&lt;p&gt;This approach minimizes redundancy, reduces storage, and simplifies updates. Change the customer’s name in one place, and every order reflects that change immediately.&lt;/p&gt;
&lt;p&gt;In analytical systems, this level of indirection becomes a performance bottleneck. Complex queries must join many tables together, which can slow things down significantly.&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;denormalization&lt;/strong&gt; comes in. In OLAP models, it’s common to store data in a flattened format, with descriptive attributes repeated across rows. While this increases storage requirements, it significantly speeds up query performance and simplifies logic for analysts and BI tools.&lt;/p&gt;
&lt;h2&gt;Star and Snowflake Schemas&lt;/h2&gt;
&lt;p&gt;Two common modeling patterns in OLAP systems are the &lt;strong&gt;star schema&lt;/strong&gt; and the &lt;strong&gt;snowflake schema&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;star schema&lt;/strong&gt; organizes data around a central fact table. This table holds measurable events: like sales transactions, with keys that reference surrounding dimension tables, which contain descriptive attributes such as product names, customer demographics, or store locations.&lt;/p&gt;
&lt;p&gt;In a star schema, the dimension tables are typically denormalized. This makes queries straightforward and fast: one central join connects the fact table to all the attributes needed for analysis.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;snowflake schema&lt;/strong&gt; takes this idea further by normalizing the dimension tables. Instead of a single product dimension table, for example, you might have separate tables for product, category, and supplier. This saves space and can improve maintainability, but at the cost of more complex joins.&lt;/p&gt;
&lt;p&gt;The choice between star and snowflake schemas depends on your performance needs, data volume, and how often attributes change.&lt;/p&gt;
&lt;h2&gt;Modeling for Flexibility and Growth&lt;/h2&gt;
&lt;p&gt;Good data models are designed with change in mind. New columns will be added, relationships will evolve, and new metrics will be needed. A rigid model can become a bottleneck, while a flexible one supports ongoing development.&lt;/p&gt;
&lt;p&gt;One best practice is to favor additive metrics when possible. These are measures you can safely sum across time or groups - like revenue or quantity sold. Additive metrics work better with aggregations and are easier to model consistently.&lt;/p&gt;
&lt;p&gt;It’s also important to consider slowly changing dimensions. For example, if a customer’s email address or a product’s price changes, do you want to reflect the latest value, or keep historical versions? Modeling for this kind of change requires thought about versioning and historical accuracy.&lt;/p&gt;
&lt;h2&gt;The Road Ahead&lt;/h2&gt;
&lt;p&gt;Data modeling sits at the intersection of technical design and business logic. It’s not just about tables and keys - it’s about making data intuitive and useful for the people who depend on it.&lt;/p&gt;
&lt;p&gt;As data engineers, our role is to create models that strike a balance between performance, maintainability, and expressiveness. Doing this well requires not just technical skill, but ongoing communication with analysts, stakeholders, and subject matter experts.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll take a closer look at data warehousing - how these models are stored, queried, and optimized in systems built for analytics at scale.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Warehousing Fundamentals</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-07/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-07/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data warehouses serve as the analytical backbone for many organizations. They are purpose-built systems that store structured data optimized for fast querying and aggregation. While data lakes handle raw, unstructured data at scale, data warehouses focus on delivering clean, organized datasets to analysts, BI tools, and decision-makers.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll break down what makes a data warehouse different from other storage systems, how it&apos;s architected, and what practices ensure it performs efficiently as your data and business grow.&lt;/p&gt;
&lt;h2&gt;The Role of a Data Warehouse&lt;/h2&gt;
&lt;p&gt;At a high level, a data warehouse collects data from multiple operational systems and stores it in a way that makes analysis easy and consistent. Instead of digging through individual source systems: like sales platforms, CRM tools, or web analytics, users can query a centralized warehouse that’s been curated and modeled for insight.&lt;/p&gt;
&lt;p&gt;This consolidation allows organizations to apply consistent definitions for metrics, reduce the risk of conflicting data interpretations, and dramatically improve performance for analytical workloads.&lt;/p&gt;
&lt;p&gt;Where a transactional database is designed to handle lots of small, rapid reads and writes, a data warehouse is designed to scan large volumes of data efficiently. These systems optimize for queries like “What were our top five products last quarter?” or “How did regional sales trend year-over-year?”&lt;/p&gt;
&lt;h2&gt;Architecture and Components&lt;/h2&gt;
&lt;p&gt;A traditional data warehouse is structured with a clear separation between compute and storage. In legacy on-premise systems like Teradata or Oracle, both functions were tightly coupled. In modern cloud-native systems like Snowflake or BigQuery, storage and compute are decoupled, which allows more flexible scaling.&lt;/p&gt;
&lt;p&gt;The core of a warehouse is the schema: the logical structure defining how data is organized into tables, relationships, and hierarchies. As discussed in the previous post, these tables often follow star or snowflake patterns, with fact tables surrounded by dimension tables that provide context.&lt;/p&gt;
&lt;p&gt;One of the key components of a warehouse is its query engine. This engine is built to efficiently execute SQL queries, taking advantage of indexing, partitioning, and columnar storage formats to return results quickly even when scanning billions of rows.&lt;/p&gt;
&lt;p&gt;Data warehouses also maintain metadata: information about data types, table relationships, and data lineage, that helps users navigate and trust the system. Many modern platforms also offer built-in tools for access control, versioning, and data classification to support governance.&lt;/p&gt;
&lt;h2&gt;Performance Optimization: Partitioning and Clustering&lt;/h2&gt;
&lt;p&gt;As warehouses scale, query performance becomes a key concern. It’s not enough to simply store the data - you also need to retrieve it quickly and cost-effectively.&lt;/p&gt;
&lt;p&gt;One common optimization is &lt;strong&gt;partitioning&lt;/strong&gt;, which breaks up large tables into smaller, manageable chunks based on a field like date, region, or product category. When a query specifies a filter on that field, the engine can skip over partitions that aren’t relevant, significantly reducing scan times.&lt;/p&gt;
&lt;p&gt;Another technique is &lt;strong&gt;clustering&lt;/strong&gt;, which organizes the physical layout of data based on a set of fields that are commonly filtered or joined on. For example, clustering sales records by customer ID can improve performance for queries that retrieve purchase history.&lt;/p&gt;
&lt;p&gt;Columnar storage is also key to performance. Unlike row-based storage, which keeps all fields of a record together, columnar formats like those used in BigQuery or Redshift store each column separately. This allows the engine to scan only the columns needed for a query, reducing I/O and speeding up execution.&lt;/p&gt;
&lt;h2&gt;Data Loading and Refresh Patterns&lt;/h2&gt;
&lt;p&gt;Getting data into the warehouse is typically done through ETL or ELT processes. These pipelines extract data from source systems, apply transformations, and load the result into warehouse tables.&lt;/p&gt;
&lt;p&gt;Loading can happen in batches: say, every hour or once a day, or in micro-batches that simulate near-real-time ingestion. The right frequency depends on your business needs and the capabilities of your orchestration tools.&lt;/p&gt;
&lt;p&gt;Incremental loading is often preferred over full reloads. By only processing new or changed records, pipelines reduce load times and warehouse compute costs. This usually requires tracking change data through mechanisms like timestamps or change data capture (CDC).&lt;/p&gt;
&lt;h2&gt;Warehouse Technologies&lt;/h2&gt;
&lt;p&gt;Several platforms dominate the modern data warehousing space, each with its strengths.&lt;/p&gt;
&lt;p&gt;Snowflake offers a fully managed, multi-cluster architecture with automatic scaling and support for semi-structured data. It separates compute from storage and supports concurrent workloads with minimal tuning.&lt;/p&gt;
&lt;p&gt;Google BigQuery is a serverless, query-on-demand platform that excels at ad hoc analytics and scales seamlessly with user demand. It’s ideal for teams that want fast performance without managing infrastructure.&lt;/p&gt;
&lt;p&gt;Amazon Redshift provides deep integration with the AWS ecosystem and allows more control over configuration, which can be valuable for teams with specific performance tuning needs.&lt;/p&gt;
&lt;p&gt;Each of these platforms supports ANSI SQL, integrates with major BI tools, and offers features for security, monitoring, and data governance.&lt;/p&gt;
&lt;h2&gt;Wrapping Up&lt;/h2&gt;
&lt;p&gt;A data warehouse isn’t just a place to store data - it’s the system of record for analytics. Its structure, performance, and accessibility determine how quickly stakeholders can make informed decisions.&lt;/p&gt;
&lt;p&gt;Designing and maintaining an effective warehouse requires a thoughtful approach to modeling, data loading, and performance tuning. As your organization grows, so do the expectations placed on your warehouse to handle increasing complexity, scale, and demand for real-time insight.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how data lakes differ from warehouses, and how they offer a flexible, scalable foundation for managing large volumes of diverse data types.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Lakes Explained</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-08/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-08/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data volumes grow and the types of data organizations work with become more varied, traditional data warehouses start to show their limits. Structured data fits neatly into tables, but what about videos, logs, images, or JSON documents with unpredictable formats? This is where the concept of a data lake comes into play.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what a data lake is, how it compares to a data warehouse, and why it’s become a cornerstone of modern data architecture.&lt;/p&gt;
&lt;h2&gt;What is a Data Lake?&lt;/h2&gt;
&lt;p&gt;A data lake is a centralized repository designed to store data in its raw form. Whether the data is structured like CSV files, semi-structured like JSON, or unstructured like text or images, the lake accepts it all. It acts as a catch-all layer for every piece of data an organization might want to use for analysis, training models, or historical archiving.&lt;/p&gt;
&lt;p&gt;Unlike a data warehouse, which expects a predefined schema and consistent structure, a data lake embraces flexibility. The idea is to collect the data first and figure out how to use it later: a principle often referred to as schema-on-read.&lt;/p&gt;
&lt;p&gt;This approach enables data engineers and scientists to access and experiment with data that hasn’t yet been modeled or cleaned. It fosters innovation by removing upfront constraints about how data should look.&lt;/p&gt;
&lt;h2&gt;Key Characteristics&lt;/h2&gt;
&lt;p&gt;At its core, a data lake is built on inexpensive, scalable storage - typically object storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. These systems offer the capacity to store petabytes of data without the overhead of traditional database systems.&lt;/p&gt;
&lt;p&gt;Because lakes deal with raw data, they don’t enforce strict schemas when data is written. Instead, structure is applied at query time. This allows different teams to interpret the same data in different ways, depending on the analysis they want to perform.&lt;/p&gt;
&lt;p&gt;This flexibility is powerful, but it comes with a cost: governance becomes more challenging. Without strong metadata management and data cataloging, lakes can quickly turn into what’s often called a “data swamp”: a cluttered repository that’s hard to navigate or trust.&lt;/p&gt;
&lt;h2&gt;Data Lakes vs Data Warehouses&lt;/h2&gt;
&lt;p&gt;The primary difference between data lakes and data warehouses lies in structure and purpose.&lt;/p&gt;
&lt;p&gt;Data warehouses are optimized for structured data, curated models, and consistent performance. They serve business users who need reliable access to cleaned, aggregated data for dashboards and reports.&lt;/p&gt;
&lt;p&gt;Data lakes are optimized for scale and flexibility. They support raw data, including logs, sensor output, and third-party feeds, making them ideal for machine learning and advanced analytics. While a warehouse is all about predefined questions and structured answers, a lake is about exploration and experimentation.&lt;/p&gt;
&lt;p&gt;In practice, many organizations use both. The lake acts as the foundation, storing everything, while the warehouse sits on top as a refined layer for operational analytics. This layered architecture sets the stage for more advanced approaches, such as the data lakehouse, which we&apos;ll explore later in this series.&lt;/p&gt;
&lt;h2&gt;Building and Managing a Data Lake&lt;/h2&gt;
&lt;p&gt;Creating a data lake involves more than dumping files into storage. A well-functioning lake includes clear organization, access controls, and metadata layers that describe what each dataset is, where it came from, and how it’s used.&lt;/p&gt;
&lt;p&gt;Data is often organized into zones. A raw zone stores unprocessed source data. A staging or clean zone contains transformed and validated datasets. A curated zone includes data that’s ready for consumption by analysts or applications.&lt;/p&gt;
&lt;p&gt;Maintaining this structure helps manage lifecycle policies, access permissions, and lineage. Cataloging tools like AWS Glue, Apache Hive Metastore, or more modern solutions like Amundsen or DataHub help track what’s in the lake and make it discoverable.&lt;/p&gt;
&lt;p&gt;Processing engines like Apache Spark, Presto, or Dremio allow users to query data directly in the lake, using SQL or custom logic. These tools interpret files stored in formats like Parquet, ORC, or Avro, applying structure dynamically based on metadata or inferred schema.&lt;/p&gt;
&lt;h2&gt;When to Use a Data Lake&lt;/h2&gt;
&lt;p&gt;A data lake makes the most sense when you’re dealing with large volumes of diverse data types or when you&apos;re unsure how the data will be used. It’s particularly valuable in environments focused on research, machine learning, or combining traditional business data with less conventional sources like social media or IoT signals.&lt;/p&gt;
&lt;p&gt;However, if you need consistent, curated data for business reporting, a warehouse may be the better choice. Data lakes and warehouses serve different needs, and understanding how they complement each other is key to building a balanced architecture.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll look at storage formats and compression - essential building blocks for making data lakes and warehouses efficient, scalable, and cost-effective.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Storage Formats and Compression</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-09/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-09/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When working with large-scale data systems, it&apos;s not just what data you store that matters - it&apos;s how you store it. The choice of storage format and compression strategy can make a significant difference in performance, cost, and usability. These decisions affect how quickly you can query data, how much storage space you need, and even how compatible your data is with various processing tools.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore the most common data storage formats, the role of compression, and how these choices impact modern data engineering workflows.&lt;/p&gt;
&lt;h2&gt;Why Storage Format Matters&lt;/h2&gt;
&lt;p&gt;Raw data often arrives in simple formats like CSV or JSON, and for small volumes, these formats work just fine. But as data grows into gigabytes or terabytes, inefficiencies start to show.&lt;/p&gt;
&lt;p&gt;Text-based formats like CSV are easy to read and parse, but they lack schema enforcement, are verbose, and are slow to process in distributed systems. JSON adds some flexibility by allowing nested structures, but it can still be quite large and inefficient when stored at scale.&lt;/p&gt;
&lt;p&gt;Columnar formats, by contrast, are designed for analytics. Instead of storing data row by row, they store values column by column. This layout enables faster queries and better compression - especially for workloads that scan only a few columns at a time.&lt;/p&gt;
&lt;p&gt;Imagine a table with hundreds of columns, but your query only needs five. With a row-based format, the system must read everything. With a columnar format, it reads just what’s needed. This is a game-changer for performance and cost in systems like data lakes and warehouses.&lt;/p&gt;
&lt;h2&gt;Common Formats in Practice&lt;/h2&gt;
&lt;p&gt;Several formats are widely used in data engineering, each with trade-offs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CSV&lt;/strong&gt; remains popular due to its simplicity and universal support. But it lacks strong typing and is prone to edge-case issues, such as inconsistent delimiters or quoting problems. It&apos;s best used for small datasets or temporary interoperability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;JSON&lt;/strong&gt; and &lt;strong&gt;XML&lt;/strong&gt; are useful for semi-structured data. JSON, in particular, is common in APIs and logs. However, it’s not space-efficient and can be slow to parse at scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet&lt;/strong&gt; is a columnar format developed by Apache. It&apos;s optimized for big data workloads and supports advanced features like nested schemas and predicate pushdown. Parquet is well-supported across tools like Spark, Hive, Dremio, and data warehouses like BigQuery and Snowflake.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Avro&lt;/strong&gt; is a row-based format with support for schema evolution. It’s often used in streaming applications and data serialization. While it’s not as query-efficient as Parquet, it excels in write-heavy and messaging scenarios.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ORC&lt;/strong&gt; (Optimized Row Columnar) is similar to Parquet but originally developed for the Hadoop ecosystem. It offers strong compression and performance benefits for read-heavy workloads.&lt;/p&gt;
&lt;p&gt;Choosing between these often comes down to the nature of the workload. If you&apos;re doing analytics over large datasets, columnar formats like Parquet or ORC are usually the right call. If you&apos;re capturing events or streaming messages, Avro might be a better fit.&lt;/p&gt;
&lt;h2&gt;The Role of Compression&lt;/h2&gt;
&lt;p&gt;Compression reduces file sizes by encoding repeated or predictable patterns more efficiently. In distributed systems, this saves both storage space and network bandwidth, speeding up data movement and reducing cost.&lt;/p&gt;
&lt;p&gt;Compression can be applied at the file level or at the column level (in columnar formats). Modern formats like Parquet support multiple compression codecs, including Snappy, Gzip, Brotli, and Zstd.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snappy&lt;/strong&gt; offers fast compression and decompression, making it a good default choice when speed matters more than maximum size reduction. &lt;strong&gt;Gzip&lt;/strong&gt; provides better compression ratios but is slower. &lt;strong&gt;Zstd&lt;/strong&gt; and &lt;strong&gt;Brotli&lt;/strong&gt; strike a balance, offering both speed and compression efficiency.&lt;/p&gt;
&lt;p&gt;When choosing a compression strategy, consider the use case. For interactive querying, speed matters, so faster codecs like Snappy are preferred. For archival data or large transfers, stronger compression may save more money in the long run.&lt;/p&gt;
&lt;h2&gt;Compatibility and Ecosystem Support&lt;/h2&gt;
&lt;p&gt;Storage format decisions also impact which tools you can use. Most modern data tools support Parquet and Avro natively, but compatibility can vary depending on the processing engine.&lt;/p&gt;
&lt;p&gt;For example, if you&apos;re building a data lake on S3 and using Apache Spark for processing, Parquet is almost always a safe choice. It integrates well with tools like Hive Metastore, Presto, Trino, and Dremio.&lt;/p&gt;
&lt;p&gt;If you’re using Kafka or other message queues, Avro is a common format due to its compactness and schema registry support.&lt;/p&gt;
&lt;p&gt;It’s also worth considering schema evolution - how well a format handles changes in the data structure over time. Avro and Parquet both support schema evolution, which allows you to add or remove fields without breaking downstream systems. This is crucial in agile environments where data changes frequently.&lt;/p&gt;
&lt;h2&gt;Putting It All Together&lt;/h2&gt;
&lt;p&gt;The best storage strategy balances performance, flexibility, and compatibility. There’s no one-size-fits-all answer, but understanding the characteristics of each format: and how compression affects storage and query speed, allows you to make informed choices.&lt;/p&gt;
&lt;p&gt;As data engineers, our job is to pick the right tools for the job, not just default to what’s familiar. Thoughtful decisions at the storage layer can ripple across the entire data stack, affecting cost, speed, and scalability.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll turn our attention to data quality and validation - because no matter how well your data is stored, it’s only as good as it is accurate, complete, and trustworthy.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Quality and Validation</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-10/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-10/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In any data system, quality is not optional - it’s foundational. No matter how scalable your architecture is, or how fast your queries run, if the underlying data is inaccurate, incomplete, or inconsistent, the results will be misleading. And bad data leads to bad decisions.&lt;/p&gt;
&lt;p&gt;This post focuses on data quality and validation. We&apos;ll look at what makes data &amp;quot;good,&amp;quot; why quality issues emerge, and how engineers can build checks and balances into pipelines to ensure the reliability of their datasets.&lt;/p&gt;
&lt;h2&gt;Defining Data Quality&lt;/h2&gt;
&lt;p&gt;At its core, data quality is about trust. Can the data be used confidently for reporting, analytics, or decision-making? While quality is a broad concept, it typically includes several dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;: Does the data reflect reality? For example, does a customer record show the correct name and email?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Completeness&lt;/strong&gt;: Are all required fields populated? Missing data can render entire records useless.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Is the data uniform across systems? If two systems say different things about the same event, which one is right?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timeliness&lt;/strong&gt;: Is the data fresh enough for its intended purpose? A report showing yesterday’s numbers might be fine - or it might be too late.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uniqueness&lt;/strong&gt;: Are there duplicate records that shouldn’t exist?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These attributes form the foundation of what we think of as “high-quality” data. But quality isn&apos;t static - it needs to be monitored continuously.&lt;/p&gt;
&lt;h2&gt;Where Data Quality Breaks Down&lt;/h2&gt;
&lt;p&gt;Quality issues usually arise at system boundaries. When data moves from one source to another: say, from a transactional database to a warehouse or from an API to a data lake, transformations, encoding issues, and format mismatches can cause subtle errors.&lt;/p&gt;
&lt;p&gt;Sometimes data is flawed at the source. A user enters a malformed email address, or a sensor transmits faulty readings due to hardware glitches. Other times, issues emerge downstream, such as when a pipeline fails silently or when schema changes aren’t communicated across teams.&lt;/p&gt;
&lt;p&gt;Even well-designed systems can encounter quality problems if the underlying business logic evolves. For example, a rule that defines how revenue is calculated may change, invalidating previous calculations if pipelines aren’t updated accordingly.&lt;/p&gt;
&lt;h2&gt;The Role of Validation&lt;/h2&gt;
&lt;p&gt;To combat these issues, validation is key. Validation is the act of checking data against expected rules and assumptions - often before it gets loaded into a final destination.&lt;/p&gt;
&lt;p&gt;This can happen at multiple stages of a pipeline. During ingestion, validation might confirm that all required fields are present and formatted correctly. During transformation, it might enforce business rules, such as ensuring that order totals are positive or that timestamps are within reasonable ranges.&lt;/p&gt;
&lt;p&gt;Validation can be passive: logging anomalies for review, or active, stopping a pipeline if thresholds are exceeded. Both approaches have their place. In some cases, it&apos;s better to allow partial data to flow through and alert the team. In others, it’s critical to block the update to prevent contamination of production datasets.&lt;/p&gt;
&lt;h2&gt;Tools for Data Quality&lt;/h2&gt;
&lt;p&gt;Several tools and frameworks have emerged to help engineers define, monitor, and enforce data quality checks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Great Expectations&lt;/strong&gt; is one of the most well-known. It allows you to define “expectations” about your data - essentially, assertions about what should be true. These expectations can be validated at runtime, and the results can be logged, visualized, or used to trigger alerts.&lt;/p&gt;
&lt;p&gt;Another option is &lt;strong&gt;Amazon Deequ&lt;/strong&gt;, a library built on top of Apache Spark that performs similar validations at scale. It’s particularly useful in large distributed environments where running manual checks would be too costly.&lt;/p&gt;
&lt;p&gt;Some orchestration platforms, like Airflow and Dagster, support custom sensors or hooks that let you embed validation logic directly into the DAG. This tight integration makes it easier to halt jobs or notify teams when something goes wrong.&lt;/p&gt;
&lt;p&gt;Beyond tools, quality also depends on process. Data contracts, code reviews, and automated testing all contribute to building a culture where quality is prioritized from the start, not added as an afterthought.&lt;/p&gt;
&lt;h2&gt;Designing for Trust&lt;/h2&gt;
&lt;p&gt;A key principle in data engineering is that quality doesn&apos;t just happen - it must be designed. That means proactively defining what “correct” looks like, instrumenting checks, and making sure failures are surfaced early.&lt;/p&gt;
&lt;p&gt;Dashboards and data catalogs can help surface issues. But even more important is visibility: stakeholders need to know when data is delayed, incomplete, or incorrect. Setting up alerts based on data quality metrics helps teams respond quickly before problems reach downstream consumers.&lt;/p&gt;
&lt;p&gt;The cost of low-quality data isn&apos;t just technical - it&apos;s strategic. If users lose faith in the data, they stop relying on it. And once trust is gone, it’s incredibly hard to rebuild.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll examine how metadata, lineage, and governance play a role in maintaining data integrity across complex systems. Knowing where your data came from and how it was transformed is just as important as validating its contents.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Metadata, Lineage, and Governance</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-11/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-11/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data systems grow more complex, understanding where your data came from, how it has changed, and who is responsible for it becomes just as critical as the data itself. It’s not enough to know that a dataset exists - you need to know how it was created, whether it’s trustworthy, and how it fits into the broader system.&lt;/p&gt;
&lt;p&gt;In this post, we’ll break down three interconnected concepts: metadata, data lineage, and governance, and explore why they’re essential to building transparent, scalable, and compliant data infrastructure.&lt;/p&gt;
&lt;h2&gt;What Is Metadata?&lt;/h2&gt;
&lt;p&gt;Metadata is data about data. It describes the contents, structure, and context of a dataset, giving you the information needed to understand how to work with it.&lt;/p&gt;
&lt;p&gt;At the most basic level, metadata includes things like column names, data types, and row counts. But it can go much deeper. Metadata can describe data freshness (when it was last updated), sensitivity (whether it contains personally identifiable information), and ownership (who created or maintains the dataset).&lt;/p&gt;
&lt;p&gt;Well-managed metadata serves as a map to your data ecosystem. It helps engineers understand dependencies, enables analysts to find the right datasets, and assists compliance teams in locating sensitive information.&lt;/p&gt;
&lt;p&gt;Without metadata, even high-quality data becomes hard to use. Teams end up duplicating effort, making incorrect assumptions, or spending more time asking questions than building insights.&lt;/p&gt;
&lt;h2&gt;Understanding Data Lineage&lt;/h2&gt;
&lt;p&gt;Data lineage is the history of how data moves and changes through your systems. It traces the path from the original source: say, a transactional database or API, all the way to its final destination in a dashboard, report, or machine learning model.&lt;/p&gt;
&lt;p&gt;Lineage tells you not just where the data is now, but how it got there. Which tables did it pass through? What transformations were applied? Was any filtering, aggregation, or enrichment performed?&lt;/p&gt;
&lt;p&gt;This visibility is crucial for several reasons. First, it helps with debugging. When a report shows an unexpected number, lineage lets you trace the logic backwards to find the source of the issue. Second, it supports impact analysis. If a schema changes in a source table, you can immediately see which downstream systems are affected.&lt;/p&gt;
&lt;p&gt;In regulated industries, lineage is also a compliance requirement. Auditors often want to see a clear trail from raw data to final output to ensure accuracy, transparency, and accountability.&lt;/p&gt;
&lt;h2&gt;The Role of Data Governance&lt;/h2&gt;
&lt;p&gt;Data governance is the set of policies, processes, and roles that ensure data is managed responsibly across an organization. It covers who has access to what data, how it should be handled, and how changes are documented and approved.&lt;/p&gt;
&lt;p&gt;Governance is often misunderstood as being purely about control, but it’s really about enabling trust at scale. In small teams, people can rely on informal communication to manage data. In large organizations, clear governance is the only way to prevent chaos.&lt;/p&gt;
&lt;p&gt;Good governance defines roles and responsibilities. Who is the data owner? Who approves changes? Who can grant access? It also sets standards for naming, documentation, and data classification so that teams can work together without constant re-alignment.&lt;/p&gt;
&lt;p&gt;This becomes even more important in environments with sensitive data. Personally identifiable information (PII), financial records, and health data all come with legal and ethical obligations. Governance ensures these datasets are properly secured, audited, and retained only as long as necessary.&lt;/p&gt;
&lt;h2&gt;Tools and Practices&lt;/h2&gt;
&lt;p&gt;To manage metadata, lineage, and governance effectively, many organizations turn to dedicated platforms. Tools like Amundsen, DataHub, and Apache Atlas offer data cataloging and discovery features that make metadata more accessible and actionable.&lt;/p&gt;
&lt;p&gt;These platforms often integrate with processing engines and orchestration tools to automatically collect lineage. For example, if a pipeline built in Airflow or dbt modifies a dataset, the lineage graph is updated to reflect that change.&lt;/p&gt;
&lt;p&gt;But tools alone aren’t enough. Teams need practices that reinforce good habits - such as documenting changes, defining clear data ownership, and reviewing access permissions regularly.&lt;/p&gt;
&lt;p&gt;Automation can help, especially in dynamic environments where datasets are frequently added or updated. But governance must also be embedded into the culture. Engineers, analysts, and stakeholders all play a part in maintaining data integrity and clarity.&lt;/p&gt;
&lt;h2&gt;Bringing It All Together&lt;/h2&gt;
&lt;p&gt;Metadata, lineage, and governance are not isolated concerns. Together, they create a foundation for transparency and trust. They help organizations understand what data they have, how it’s being used, and whether it can be relied upon.&lt;/p&gt;
&lt;p&gt;Without this foundation, even the best-engineered pipelines can become liabilities. But with it, data becomes a strategic asset - one that teams can build on confidently, securely, and efficiently.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how workflow orchestration ties these pieces together, enabling you to manage complex data pipelines reliably across diverse tools and systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Scheduling and Workflow Orchestration</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-12/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-12/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data pipelines grow in complexity, managing them manually becomes unsustainable. Whether you&apos;re running daily ETL jobs, refreshing dashboards, or processing streaming data in micro-batches, you need a way to coordinate and monitor these tasks reliably. That’s where workflow orchestration comes in.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll explore what orchestration means in the context of data engineering, how it differs from simple job scheduling, and what tools and design patterns help keep data workflows organized, observable, and resilient.&lt;/p&gt;
&lt;h2&gt;From Scheduling to Orchestration&lt;/h2&gt;
&lt;p&gt;At the simplest level, scheduling is about running tasks at a certain time. A cron job that triggers a Python script every morning is a form of scheduling. For small pipelines with few dependencies, this can be enough.&lt;/p&gt;
&lt;p&gt;But modern data systems rarely involve just one job. Instead, they include chains of tasks - data extractions, file transformations, validation checks, and loads into various targets. These tasks have dependencies, need error handling, and often require conditional logic. This is where orchestration becomes essential.&lt;/p&gt;
&lt;p&gt;Workflow orchestration is the discipline of managing task execution across a defined sequence, ensuring that tasks run in the correct order, on time, and with awareness of success or failure. It&apos;s not just about launching scripts - it&apos;s about understanding how those scripts relate to one another, how they behave under different conditions, and how to recover when something goes wrong.&lt;/p&gt;
&lt;h2&gt;Directed Acyclic Graphs (DAGs)&lt;/h2&gt;
&lt;p&gt;Most orchestration systems use the concept of a Directed Acyclic Graph (DAG) to represent workflows. In a DAG, each node represents a task, and edges represent dependencies. The &amp;quot;acyclic&amp;quot; part means there are no loops: each task runs once, and the flow moves in one direction.&lt;/p&gt;
&lt;p&gt;This structure allows you to define complex workflows declaratively. For example, you might define a pipeline where data is first extracted from an API, then validated, transformed, and finally loaded into a data warehouse. If any step fails, the system can stop the pipeline, alert the team, or retry the task based on configuration.&lt;/p&gt;
&lt;p&gt;DAGs also make it easier to track the status of each component. You can visualize which tasks succeeded, which are still running, and where failures occurred. This visibility is crucial for maintaining trust in your data pipelines.&lt;/p&gt;
&lt;h2&gt;Common Orchestration Tools&lt;/h2&gt;
&lt;p&gt;Several orchestration frameworks have become standard in the data engineering ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt; is one of the most widely adopted tools. It allows users to define DAGs using Python code, which makes it highly flexible and programmable. Airflow includes scheduling, retries, logging, and a web UI for monitoring workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prefect&lt;/strong&gt; takes a modern approach by separating the orchestration layer from execution, which makes it more cloud-native and resilient to task failures. Prefect’s focus on observability and developer experience has made it popular for teams managing dynamic workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dagster&lt;/strong&gt; emphasizes data assets and type safety. It treats data pipelines as modular, testable units and integrates tightly with modern tooling, including dbt and cloud environments.&lt;/p&gt;
&lt;p&gt;Each of these tools supports task dependencies, conditional logic, parallelism, and failure recovery. Choosing the right one often comes down to team preference, operational needs, and ecosystem compatibility.&lt;/p&gt;
&lt;h2&gt;Best Practices in Workflow Design&lt;/h2&gt;
&lt;p&gt;Designing orchestration workflows requires more than chaining tasks together. Robust pipelines include thoughtful handling of edge cases and clear observability. That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using retries and timeouts to deal with flaky services or transient failures.&lt;/li&gt;
&lt;li&gt;Logging meaningful output so that issues can be diagnosed quickly.&lt;/li&gt;
&lt;li&gt;Isolating tasks so that a failure in one part doesn’t compromise unrelated workflows.&lt;/li&gt;
&lt;li&gt;Tagging or labeling jobs by function or owner to improve maintainability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It also means thinking about idempotency. Tasks should be safe to rerun if needed. For example, a data load job that inserts duplicate rows each time it runs will cause problems if retried. Designing tasks to either overwrite cleanly or check for prior completion helps prevent these issues.&lt;/p&gt;
&lt;p&gt;Another key practice is modularity. Instead of building large monolithic DAGs, break workflows into reusable components. This makes it easier to test, maintain, and scale your pipelines as your data ecosystem evolves.&lt;/p&gt;
&lt;h2&gt;Observability and Alerting&lt;/h2&gt;
&lt;p&gt;A well-orchestrated pipeline doesn’t just run - it tells you how it’s running. Observability is about surfacing the right information at the right time so that engineers can respond to issues quickly.&lt;/p&gt;
&lt;p&gt;Good orchestration tools provide dashboards, logs, and metrics. But equally important are alerts that notify the right people when something goes wrong. Alerts should be actionable and avoid noise. A system that sends alerts on every minor warning will eventually be ignored.&lt;/p&gt;
&lt;p&gt;Integrating with monitoring platforms like Prometheus, Grafana, or external alerting tools like PagerDuty or Slack helps ensure that teams can respond to problems before they affect end users.&lt;/p&gt;
&lt;h2&gt;Orchestration as the Backbone&lt;/h2&gt;
&lt;p&gt;Workflow orchestration isn’t just a technical layer - it’s the backbone of operational data systems. It connects ingestion, transformation, validation, and delivery in a reliable and auditable way. When done well, it turns complex processes into predictable, repeatable workflows that teams can build on confidently.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how to build scalable pipelines, including how to think about performance, parallelism, and distribution when dealing with large or fast-growing datasets.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Building Scalable Pipelines</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-13/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-13/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data volumes increase and workflows grow more interconnected, the ability to build scalable data pipelines becomes essential. It&apos;s not enough for a pipeline to work - it needs to keep working as data grows from gigabytes to terabytes, as new sources are added, and as more users rely on the output for decision-making.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what makes a pipeline scalable, the principles behind designing for growth, and the tools and patterns that data engineers use to manage complexity at scale.&lt;/p&gt;
&lt;h2&gt;What Do We Mean by Scalability?&lt;/h2&gt;
&lt;p&gt;Scalability is about more than just performance. It&apos;s the ability of a system to maintain its functionality and responsiveness as load increases. In the context of data pipelines, this means handling larger datasets, higher data velocity, and more frequent processing without constant reengineering.&lt;/p&gt;
&lt;p&gt;A scalable pipeline gracefully adapts to changes in data size, structure, and frequency. It’s designed in a modular way, so that bottlenecks can be addressed without rewriting the entire system. And it’s observable and maintainable, so issues can be diagnosed before they affect users.&lt;/p&gt;
&lt;p&gt;Scalability also involves cost efficiency. Throwing more resources at a slow pipeline might fix the symptoms, but a well-designed system scales intelligently, minimizing unnecessary computation and data movement.&lt;/p&gt;
&lt;h2&gt;Parallelism and Distribution&lt;/h2&gt;
&lt;p&gt;One of the core principles behind scalability is parallelism: the ability to split work into independent chunks that can be processed simultaneously.&lt;/p&gt;
&lt;p&gt;In batch workflows, this might mean partitioning data by date or region and processing each partition in parallel. In streaming systems, it means dividing incoming data into partitions or shards that are consumed by multiple workers.&lt;/p&gt;
&lt;p&gt;Distributed computing frameworks like Apache Spark, Flink, and Dask are designed with this in mind. They break down data into smaller units, distribute them across a cluster of machines, and execute tasks in parallel, tracking dependencies and ensuring consistency across the system.&lt;/p&gt;
&lt;p&gt;But parallelism introduces its own challenges. Data skew: when one partition is significantly larger than others, can lead to uneven workloads and poor performance. Effective partitioning strategies and thoughtful job configuration are key to maintaining balance.&lt;/p&gt;
&lt;h2&gt;Minimizing Data Movement&lt;/h2&gt;
&lt;p&gt;Another aspect of scalability is reducing how often and how far data moves. Every transfer across a network or system boundary adds latency, cost, and potential failure points.&lt;/p&gt;
&lt;p&gt;Where possible, pipelines should process data close to where it&apos;s stored. For example, using a query engine like Dremio or Presto to query data directly from object storage avoids the overhead of loading it into a warehouse first.&lt;/p&gt;
&lt;p&gt;Materializing only what’s needed, caching intermediate results, and pushing filters down into source systems are all ways to reduce unnecessary computation and movement.&lt;/p&gt;
&lt;p&gt;Streaming pipelines, in particular, benefit from minimizing state size and using windowed processing, so that each event is handled quickly and discarded once processed.&lt;/p&gt;
&lt;h2&gt;Managing Resources&lt;/h2&gt;
&lt;p&gt;Scalable pipelines require careful resource management. Compute, memory, and I/O all need to be provisioned in a way that meets demand without excessive overhead.&lt;/p&gt;
&lt;p&gt;Autoscaling, used in many cloud-native environments, allows processing clusters to grow and shrink based on workload. This is especially valuable for unpredictable or bursty workloads, where fixed infrastructure would either overrun or sit idle.&lt;/p&gt;
&lt;p&gt;Monitoring and alerting tools provide visibility into where resources are being used inefficiently. Long-running jobs, slow joins, or excessive data shuffles can all indicate areas where performance tuning is needed.&lt;/p&gt;
&lt;p&gt;Tuning batch sizes, controlling concurrency, and using backpressure mechanisms in streaming systems help maintain throughput without overloading infrastructure.&lt;/p&gt;
&lt;h2&gt;Designing for Change&lt;/h2&gt;
&lt;p&gt;Scalability isn’t just about today’s workload - it’s about tomorrow’s. Data pipelines should be designed to evolve.&lt;/p&gt;
&lt;p&gt;This means avoiding hard-coded assumptions about schema, partitions, or file sizes. It means using configuration over code where possible, and abstracting logic into reusable modules that can be adapted as requirements shift.&lt;/p&gt;
&lt;p&gt;Schema evolution support, metadata management, and data contracts between producers and consumers help ensure that changes can be made safely, without breaking downstream systems.&lt;/p&gt;
&lt;p&gt;Testing plays a big role here as well. Unit tests for transformations, integration tests for pipeline steps, and data quality checks all contribute to a system that can grow without becoming brittle.&lt;/p&gt;
&lt;h2&gt;Bringing It All Together&lt;/h2&gt;
&lt;p&gt;Scalable pipelines don’t happen by accident. They’re the result of intentional design choices that account for volume, velocity, and variability.&lt;/p&gt;
&lt;p&gt;By embracing parallelism, minimizing data movement, managing resources effectively, and planning for change, data engineers can build pipelines that not only meet today’s demands but are ready for tomorrow’s challenges.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll look at how DevOps principles apply to data engineering - covering CI/CD, infrastructure as code, and the tools that support reliable and automated data deployments.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | DevOps for Data Engineering</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-14/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-14/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data systems grow more complex and interconnected, the principles of DevOps: long applied to software engineering, have become increasingly relevant to data engineering. Continuous integration, infrastructure as code, testing, and automation aren’t just for deploying apps anymore. They’re essential for delivering reliable, maintainable, and scalable data pipelines.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore how DevOps practices translate into the world of data engineering, why they matter, and what tools and techniques help bring them to life in modern data teams.&lt;/p&gt;
&lt;h2&gt;Bridging the Gap Between Code and Data&lt;/h2&gt;
&lt;p&gt;At the heart of DevOps is the idea that development and operations should be integrated. In traditional software development, this means automating the steps from writing code to running it in production. For data engineering, the challenge is similar - but the output isn&apos;t always a user-facing app. Instead, it&apos;s pipelines, transformations, and datasets that power reports, dashboards, and machine learning models.&lt;/p&gt;
&lt;p&gt;The core question becomes: how do we ensure that changes to data workflows are tested, deployed, and monitored with the same rigor as application code?&lt;/p&gt;
&lt;p&gt;The answer lies in adopting DevOps-inspired practices like version control, automated testing, continuous deployment, and infrastructure automation: all tailored to the specifics of data systems.&lt;/p&gt;
&lt;h2&gt;Version Control for Pipelines and Configurations&lt;/h2&gt;
&lt;p&gt;Just like in software engineering, all code that defines your data infrastructure: SQL queries, transformation logic, orchestration DAGs, and even schema definitions, should live in version-controlled repositories.&lt;/p&gt;
&lt;p&gt;This makes it easier to collaborate, review changes, and roll back when something breaks. Tools like Git, combined with platforms like GitHub or GitLab, provide the foundation. Branching strategies and pull requests help teams manage change in a structured, auditable way.&lt;/p&gt;
&lt;p&gt;Even configurations: such as data source definitions or schedule timings, can and should be versioned, ideally alongside the pipeline logic they support.&lt;/p&gt;
&lt;h2&gt;Continuous Integration and Testing&lt;/h2&gt;
&lt;p&gt;Data pipelines are code, and they should be tested like code. This includes unit tests for transformation logic, integration tests for full pipeline runs, and data quality checks that assert assumptions about the shape and content of your data.&lt;/p&gt;
&lt;p&gt;CI pipelines, powered by tools like GitHub Actions, GitLab CI, or Jenkins, can run these tests automatically on each commit or pull request. They ensure that changes don’t break existing functionality or introduce regressions.&lt;/p&gt;
&lt;p&gt;Testing data workflows is more nuanced than testing application logic. It often involves staging environments with synthetic or sample data, mocking external dependencies, and verifying outputs across time windows. But the goal is the same: catch problems early, not after they hit production.&lt;/p&gt;
&lt;h2&gt;Infrastructure as Code&lt;/h2&gt;
&lt;p&gt;Managing infrastructure manually: whether it’s a Spark cluster, an Airflow deployment, or a cloud storage bucket, doesn’t scale. Infrastructure as code (IaC) provides a way to define your environment in declarative files that can be versioned, reviewed, and deployed automatically.&lt;/p&gt;
&lt;p&gt;Tools like Terraform, Pulumi, and CloudFormation allow data teams to define compute resources, networking, permissions, and even pipeline configurations as code. Combined with CI/CD, IaC enables repeatable deployments, easier disaster recovery, and consistent environments across dev, staging, and production.&lt;/p&gt;
&lt;p&gt;IaC also helps in tracking infrastructure changes over time. When something breaks, you can look at the exact commit that introduced the change - not just guess what might have gone wrong.&lt;/p&gt;
&lt;h2&gt;Continuous Deployment for Pipelines&lt;/h2&gt;
&lt;p&gt;Once code is tested and approved, it needs to be deployed. Continuous deployment automates this step, pushing new pipeline definitions or transformation logic into production systems with minimal manual intervention.&lt;/p&gt;
&lt;p&gt;In practice, this might mean updating DAGs in Airflow, deploying dbt models, or rolling out new configurations to a Kafka stream processor. The process should include validation steps, such as verifying schema compatibility or testing data output in a sandbox environment before it goes live.&lt;/p&gt;
&lt;p&gt;Feature flags and gradual rollouts: techniques borrowed from application development, can also be applied to data. They allow teams to test changes on a subset of data or users before promoting them system-wide.&lt;/p&gt;
&lt;h2&gt;Monitoring and Incident Response&lt;/h2&gt;
&lt;p&gt;Finally, DevOps emphasizes the importance of monitoring and observability. Data pipelines need the same treatment. Logs, metrics, and alerts should provide insight into pipeline health, performance, and failures.&lt;/p&gt;
&lt;p&gt;Tools like Prometheus, Grafana, and cloud-native observability platforms can be integrated with orchestration tools to expose runtime metrics. Custom dashboards can show pipeline durations, success rates, and error counts. Alerts can notify teams when jobs fail or when output data violates expectations.&lt;/p&gt;
&lt;p&gt;Just as importantly, incidents should feed back into improvement. Postmortems, runbooks, and blameless retrospectives help teams learn from failures and evolve their systems.&lt;/p&gt;
&lt;h2&gt;Shifting the Culture&lt;/h2&gt;
&lt;p&gt;Adopting DevOps for data engineering is as much about culture as it is about tools. It means treating data workflows with the same discipline as software systems - building, testing, deploying, and monitoring them in automated, repeatable ways.&lt;/p&gt;
&lt;p&gt;This cultural shift leads to faster iterations, fewer outages, and more confidence in the data products that teams rely on. It also reduces the operational load on engineers, freeing them to focus on value creation instead of firefighting.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll step back and look at the cloud ecosystem that underpins much of this work. Understanding the role of managed services and cloud-native tools is key to building a modern, agile data platform.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Cloud Data Platforms and the Modern Stack</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-15/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-15/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The cloud has transformed how organizations approach data engineering. What once required physical servers, manual provisioning, and heavyweight infrastructure can now be spun up in minutes with managed, scalable services. But with this convenience comes complexity - deciding how to compose the right mix of tools and platforms for your data workflows.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what defines the modern data stack, how cloud platforms like AWS, GCP, and Azure fit into the picture, and what principles guide the design of flexible, cloud-native data architectures.&lt;/p&gt;
&lt;h2&gt;Moving Beyond On-Premise&lt;/h2&gt;
&lt;p&gt;In traditional, on-premise data systems, teams had to manage everything themselves - hardware, networking, databases, storage, and backups. Scaling required buying more servers. Upgrades were slow, and experimentation was costly.&lt;/p&gt;
&lt;p&gt;Cloud platforms shifted this model. Infrastructure became elastic. Managed services replaced self-hosted databases and batch processing engines. What used to take weeks could now be done in hours. This shift enabled data engineers to focus more on business logic and less on infrastructure maintenance.&lt;/p&gt;
&lt;p&gt;But while the cloud solved many problems, it also introduced new decisions. With so many tools available, how do you choose the right combination? That’s where the concept of the modern data stack comes in.&lt;/p&gt;
&lt;h2&gt;What Is the Modern Data Stack?&lt;/h2&gt;
&lt;p&gt;The modern data stack refers to a collection of tools: often cloud-native, that work together to support the full data lifecycle: ingestion, transformation, storage, orchestration, and analysis.&lt;/p&gt;
&lt;p&gt;Typically, this stack includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A cloud data warehouse like Snowflake, BigQuery, or Redshift&lt;/li&gt;
&lt;li&gt;An ingestion tool such as Fivetran, Airbyte, or custom streaming connectors&lt;/li&gt;
&lt;li&gt;A transformation framework like dbt&lt;/li&gt;
&lt;li&gt;An orchestration platform like Airflow or Prefect&lt;/li&gt;
&lt;li&gt;BI tools such as Looker, Mode, or Tableau&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tools are designed to be modular and API-driven. You can swap components as your needs evolve, without having to rebuild the entire system. They also tend to embrace SQL, making them accessible to a broader range of users, including analysts and analytics engineers.&lt;/p&gt;
&lt;p&gt;This composability is powerful, but it requires thoughtful integration. Data engineers must understand how data flows across services, how metadata is preserved, and where bottlenecks can emerge.&lt;/p&gt;
&lt;h2&gt;Managed Services in the Cloud&lt;/h2&gt;
&lt;p&gt;Each major cloud provider offers a suite of services tailored to data engineering.&lt;/p&gt;
&lt;p&gt;On &lt;strong&gt;AWS&lt;/strong&gt;, services like S3 (storage), Glue (ETL), Redshift (warehousing), and Kinesis (streaming) form the core building blocks. AWS is known for its breadth and flexibility, making it a strong choice for teams that want control and are comfortable managing complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google Cloud Platform (GCP)&lt;/strong&gt; centers around BigQuery, a serverless, high-performance data warehouse. Paired with Dataflow (streaming and batch processing), Pub/Sub (messaging), and Looker (BI), GCP offers a tight integration between services with a focus on simplicity and scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Microsoft Azure&lt;/strong&gt; provides tools like Synapse Analytics, Data Factory, and Event Hubs. It often appeals to enterprise environments already invested in Microsoft’s ecosystem, offering deep integration with Active Directory, Power BI, and other services.&lt;/p&gt;
&lt;p&gt;Each platform brings its own pricing models, performance characteristics, and operational trade-offs. Choosing one often comes down to organizational context - existing infrastructure, skillsets, and vendor relationships.&lt;/p&gt;
&lt;h2&gt;Designing for Agility&lt;/h2&gt;
&lt;p&gt;A key advantage of the cloud is its ability to support experimentation. You can test new tools, build proof-of-concepts, and iterate quickly without long procurement cycles or sunk infrastructure costs.&lt;/p&gt;
&lt;p&gt;This agility enables teams to build for today while planning for tomorrow. For example, a team might start with batch ingestion and transformation using dbt and Airflow. As data needs grow, they can add streaming layers with Kafka and Spark, or move toward a lakehouse architecture using Iceberg and Dremio.&lt;/p&gt;
&lt;p&gt;To design for agility, it’s important to decouple systems where possible. Avoid hard-wiring logic across tools. Use metadata and configuration layers to manage pipeline logic. Embrace standards like Parquet or Arrow to ensure interoperability between tools.&lt;/p&gt;
&lt;p&gt;Observability and governance also become more important in a distributed cloud environment. Knowing where your data is, how it’s being used, and who has access requires integrated monitoring, logging, and metadata management.&lt;/p&gt;
&lt;h2&gt;The Cloud is Not Just a Hosting Model&lt;/h2&gt;
&lt;p&gt;Adopting cloud data platforms is not just about moving infrastructure off-premise - it’s about rethinking how teams operate. Cloud-native architectures prioritize scalability, flexibility, and automation.&lt;/p&gt;
&lt;p&gt;They allow you to treat data as a product, with well-defined interfaces, quality guarantees, and ownership. They enable collaboration across roles: engineers, analysts, and scientists, by providing shared platforms and standardized workflows.&lt;/p&gt;
&lt;p&gt;Ultimately, the modern data stack is not a fixed set of tools, but a mindset. It&apos;s about building systems that are composable, observable, and adaptable. It’s about enabling fast iteration without sacrificing reliability.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll shift into the final phase of this series and explore the evolution toward data lakehouse architectures - what they are, why they matter, and how they unify the best of both lakes and warehouses.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Lakehouse Architecture Explained</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-16/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-16/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data lakes and data warehouses each brought strengths and limitations to the way organizations manage analytics. Lakes offered flexibility and scale, but lacked consistency and performance. Warehouses delivered speed and structure, but often at the cost of rigidity and duplication. The data lakehouse aims to unify the best of both worlds.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what a data lakehouse is, how it differs from its predecessors, and why it represents a fundamental shift in modern data architecture.&lt;/p&gt;
&lt;h2&gt;The Problem with Separate Systems&lt;/h2&gt;
&lt;p&gt;Historically, data teams maintained two separate systems: a data lake for raw, large-scale data and a warehouse for clean, curated analytics. This split introduced a number of challenges.&lt;/p&gt;
&lt;p&gt;Data had to be copied and transformed between systems. Pipelines became complex and brittle, often requiring multiple processing steps to move data from lake storage into a format usable by the warehouse. Governance and metadata management were fragmented. And teams ended up managing duplicate logic in two places, increasing both cost and risk.&lt;/p&gt;
&lt;p&gt;This led to a common problem: organizations had access to a lot of data, but not in a way that was fully consistent, trustworthy, or timely.&lt;/p&gt;
&lt;h2&gt;What is a Lakehouse?&lt;/h2&gt;
&lt;p&gt;A lakehouse is a single data architecture that combines the scalability and cost-efficiency of a data lake with the data management features of a warehouse. Instead of maintaining separate systems for raw and curated data, a lakehouse enables you to store all data in one place: typically an object store like S3 or ADLS, while layering in transactional guarantees, schema enforcement, and performance optimizations.&lt;/p&gt;
&lt;p&gt;The core idea is to treat the lake as the foundation, and then build capabilities on top that make it feel like a warehouse: support for SQL queries, fine-grained access controls, data versioning, and support for BI tools.&lt;/p&gt;
&lt;p&gt;With a lakehouse, you can ingest raw data, apply transformations, and serve both data scientists and business analysts from the same platform - without having to move or duplicate data between systems.&lt;/p&gt;
&lt;h2&gt;Key Capabilities&lt;/h2&gt;
&lt;p&gt;A few innovations make the lakehouse model possible:&lt;/p&gt;
&lt;p&gt;First, &lt;strong&gt;table formats&lt;/strong&gt; like Apache Iceberg and Delta Lake introduce ACID transactions to files stored in data lakes. This means you can safely update, insert, and delete records with consistency, even across distributed systems.&lt;/p&gt;
&lt;p&gt;Second, &lt;strong&gt;query engines&lt;/strong&gt; like Dremio, Trino, and Starburst have matured to the point where they can run fast, complex SQL queries directly against files in the lake - especially when using efficient columnar formats like Parquet.&lt;/p&gt;
&lt;p&gt;Third, metadata and cataloging layers have improved, enabling better schema management, lineage tracking, and discovery across lakehouse tables.&lt;/p&gt;
&lt;p&gt;Together, these advancements bridge the gap between raw storage and structured analytics, making it possible to build a cohesive data platform without compromise.&lt;/p&gt;
&lt;h2&gt;Benefits of the Lakehouse Approach&lt;/h2&gt;
&lt;p&gt;One of the most compelling benefits of a lakehouse is &lt;strong&gt;simplification&lt;/strong&gt;. Instead of building multiple pipelines to synchronize data between systems, teams can work from a single source of truth. This reduces latency, lowers operational complexity, and improves data consistency.&lt;/p&gt;
&lt;p&gt;Lakehouses are also &lt;strong&gt;cost-effective&lt;/strong&gt;. Object storage is cheaper and more scalable than traditional databases. And by avoiding the need to load data into separate warehouses, you eliminate redundant storage and computation.&lt;/p&gt;
&lt;p&gt;From a flexibility standpoint, the lakehouse supports a wide range of use cases: from batch analytics to interactive SQL to machine learning, all from the same underlying data.&lt;/p&gt;
&lt;p&gt;Importantly, the lakehouse model supports &lt;strong&gt;open standards&lt;/strong&gt;. With formats like Iceberg, you’re not locked into a single vendor’s ecosystem. Your data remains portable, and you can build your stack using best-of-breed components.&lt;/p&gt;
&lt;h2&gt;A New Foundation for the Future&lt;/h2&gt;
&lt;p&gt;The data lakehouse is more than a marketing term - it represents a practical response to the needs of modern data teams. As data volumes continue to grow, and as organizations seek faster, more reliable insights, the need for unified, scalable architectures becomes clear.&lt;/p&gt;
&lt;p&gt;By combining the raw power of data lakes with the structure and performance of data warehouses, the lakehouse offers a way to do more with less - less duplication, less movement, and less friction.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll dig deeper into the technologies that make the lakehouse possible, starting with Apache Iceberg, Apache Arrow, and Apache Polaris. These tools form the foundation of many modern analytic platforms and help bring the lakehouse vision to life.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Apache Iceberg, Arrow, and Polaris</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-17/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-17/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As the data lakehouse ecosystem matures, new technologies are emerging to close the gap between raw, scalable storage and the structured, governed world of traditional analytics. Apache Iceberg, Apache Arrow, and Apache Polaris are three such technologies: each playing a distinct role in enabling high-performance, cloud-native data platforms that prioritize openness, flexibility, and consistency.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what each of these technologies brings to the table and how they work together to power modern data workflows.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Table Format That Changes Everything&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is more than just a file format - it’s a table format designed to bring SQL-like features to cloud object storage. In traditional data lakes, data is stored in files, but there’s no built-in concept of a table. This makes operations like updates, deletes, or time travel difficult to implement consistently.&lt;/p&gt;
&lt;p&gt;Iceberg solves that by introducing a transactional metadata layer. Tables are made up of snapshots, each pointing to a set of manifest files that describe the underlying data files. Every time data is written or updated, a new snapshot is created, and the metadata is atomically updated.&lt;/p&gt;
&lt;p&gt;This architecture enables reliable schema evolution, partition pruning, and time travel. It also supports concurrent writes across engines, making Iceberg a foundational layer for scalable, multi-engine data platforms.&lt;/p&gt;
&lt;p&gt;Importantly, Iceberg is engine-agnostic. Spark, Flink, Trino, Snowflake, and Dremio all support reading and writing to Iceberg tables, which allows data teams to avoid vendor lock-in and build modular systems.&lt;/p&gt;
&lt;h2&gt;Apache Arrow: A Universal Memory Format&lt;/h2&gt;
&lt;p&gt;If Iceberg handles data at rest, Apache Arrow handles data in motion. Arrow is a columnar in-memory format optimized for analytical processing. It allows systems to share data across process boundaries without serialization overhead, which dramatically reduces latency in data transfers.&lt;/p&gt;
&lt;p&gt;In practice, Arrow powers faster execution of queries, especially in environments where performance is critical. Engines like Dremio and frameworks like pandas or Apache Flight use Arrow to move data between components efficiently.&lt;/p&gt;
&lt;p&gt;Because Arrow defines a common representation for tabular data in memory, it allows tools built in different languages and frameworks to interoperate seamlessly. That’s a big deal in heterogeneous environments where Python, Java, and C++ may all play a role in the same workflow.&lt;/p&gt;
&lt;p&gt;Together, Iceberg and Arrow represent a powerful separation of concerns: Arrow optimizes processing in RAM, while Iceberg provides the transactional storage layer on disk.&lt;/p&gt;
&lt;h2&gt;Apache Polaris: The Missing Catalog Layer&lt;/h2&gt;
&lt;p&gt;As Iceberg adoption grows, managing Iceberg tables across distributed query engines becomes a challenge. That’s where Apache Polaris comes in.&lt;/p&gt;
&lt;p&gt;Polaris is an implementation of the Apache Iceberg REST catalog specification. It provides a centralized service for managing metadata about Iceberg tables and their organizational structure. Instead of having every engine implement its own catalog logic, Polaris provides a shared layer that orchestrates access across tools like Spark, Flink, Trino, and Snowflake.&lt;/p&gt;
&lt;p&gt;At the heart of Polaris is the concept of a &lt;strong&gt;catalog&lt;/strong&gt;: a logical container for Iceberg tables, configured to point to your cloud storage. Polaris supports both internal and external catalogs. Internal catalogs are fully managed within Polaris, while external catalogs sync with systems like Snowflake or Dremio Arctic. This flexibility lets you bring your existing Iceberg assets under centralized governance without locking them in.&lt;/p&gt;
&lt;p&gt;Polaris organizes tables into &lt;strong&gt;namespaces&lt;/strong&gt;, which are essentially folders within a catalog. These namespaces can be nested to reflect organizational or project hierarchies. Within a namespace, you register Iceberg tables, which can then be accessed by multiple engines through a consistent API.&lt;/p&gt;
&lt;p&gt;To connect to Polaris, engines use &lt;strong&gt;service principals&lt;/strong&gt; - authenticated entities with specific privileges. These principals are grouped into &lt;strong&gt;principal roles&lt;/strong&gt;, which receive access rights from &lt;strong&gt;catalog roles&lt;/strong&gt;. This role-based access control (RBAC) system allows for fine-grained security across catalogs, namespaces, and tables.&lt;/p&gt;
&lt;p&gt;What makes Polaris especially powerful is its ability to vend &lt;strong&gt;temporary credentials&lt;/strong&gt; during query execution. When a query runs, Polaris provides secure access to the underlying storage without exposing long-term cloud credentials. This mechanism, known as credential vending, ensures both security and operational flexibility.&lt;/p&gt;
&lt;h2&gt;A Unified Ecosystem&lt;/h2&gt;
&lt;p&gt;Together, Apache Iceberg, Arrow, and Polaris create a cohesive environment where data can be stored, processed, and accessed consistently and securely - regardless of the engine being used.&lt;/p&gt;
&lt;p&gt;Iceberg brings data warehouse-like capabilities to cloud storage. Arrow enables high-performance, memory-efficient processing across languages and systems. Polaris acts as the control plane, coordinating access and governance.&lt;/p&gt;
&lt;p&gt;This architecture aligns with the ideals of the data lakehouse: open standards, decoupled compute and storage, and interoperability across tools. By building on these technologies, organizations can future-proof their data platforms while empowering teams to work with the tools they prefer.&lt;/p&gt;
&lt;p&gt;In the next and final post in this series, we’ll look at Dremio: a platform that ties these components together to deliver interactive, self-service analytics directly on the data lake, without moving data or duplicating logic.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | The Power of Dremio in the Modern Lakehouse</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-18/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-18/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As organizations shift toward data lakehouse architectures, the question isn’t just how to store massive volumes of data - it’s how to optimize it for fast, reliable access without adding complexity or operational overhead. Dremio addresses this challenge head-on by combining performance, governance, and openness into a platform built natively on Apache Iceberg, Apache Arrow, and Apache Polaris.&lt;/p&gt;
&lt;p&gt;In this final post of our series, we’ll explore how Dremio ties together the technologies we&apos;ve discussed: like clustering, reflections, and cataloging, into an integrated solution for modern data engineering. We’ll cover what makes Dremio unique, how its latest innovations like Iceberg Clustering and Autonomous Reflections work, and why these capabilities are a breakthrough for data teams aiming to do more with less.&lt;/p&gt;
&lt;h2&gt;Built for the Modern Stack&lt;/h2&gt;
&lt;p&gt;Dremio isn&apos;t just a SQL engine - it’s a full data platform built for the lakehouse era. It operates directly on data stored in open formats like Parquet and Iceberg, using Apache Arrow for in-memory performance and Apache Polaris for metadata management and governance. The result is a platform that offers sub-second queries, native support for open standards, and a unified experience across ingestion, transformation, exploration, and security.&lt;/p&gt;
&lt;p&gt;Instead of requiring teams to move data into a proprietary warehouse, Dremio enables query federation across lakes, catalogs, and traditional databases. Whether your data lives in S3, GCS, Azure, or multiple warehouses, Dremio can connect, query, and govern it: all without duplication or data movement.&lt;/p&gt;
&lt;p&gt;But what truly sets Dremio apart is its focus on intelligent automation and data layout optimization. Let’s break down how these features work.&lt;/p&gt;
&lt;h2&gt;Iceberg Clustering: Smarter Data Organization&lt;/h2&gt;
&lt;p&gt;As datasets grow, traditional partitioning strategies fall short. Over-partitioning leads to a flood of small files. Under-partitioning causes massive scan overhead. Dremio introduces Iceberg Clustering to address this gap.&lt;/p&gt;
&lt;p&gt;Instead of dividing data into rigid partitions, clustering organizes rows based on column value proximity using Z-ordering, a type of space-filling curve. This technique braids together bits from multiple columns to form an index that preserves locality. The closer the index values, the closer the original rows were in value space, making it easier for the engine to skip irrelevant data.&lt;/p&gt;
&lt;p&gt;By clustering non-partitioned tables, Dremio can dramatically reduce the number of data files and row groups scanned during queries. The result: faster performance without the rigidity or complexity of traditional partitioning.&lt;/p&gt;
&lt;p&gt;This process is incremental and adaptive. Dremio monitors data file overlap (measured via clustering depth) and selectively rewrites files to restore efficient layout. You don’t have to re-cluster everything or worry about perfect partition granularity - Dremio handles it dynamically and intelligently.&lt;/p&gt;
&lt;h2&gt;Autonomous Reflections: AI for Query Optimization&lt;/h2&gt;
&lt;p&gt;Materialized views are great - until you have to decide which ones to create, maintain, and drop. Dremio automates this process with Autonomous Reflections, which monitor your workloads, identify performance bottlenecks, and generate pre-aggregated or pre-filtered views to accelerate queries.&lt;/p&gt;
&lt;p&gt;The system analyzes usage patterns and query plans, scores potential reflections based on estimated time savings, and creates only those that deliver meaningful impact. It even keeps them up to date using live metadata refresh and incremental updates, ensuring performance gains without sacrificing freshness.&lt;/p&gt;
&lt;p&gt;Reflections are created, scored, and dropped automatically based on cost-benefit analysis, with strict guardrails to avoid wasting resources. This isn’t just automation - it’s intelligent, usage-aware optimization.&lt;/p&gt;
&lt;p&gt;With Dremio’s Autonomous Reflections, query acceleration becomes invisible to the user. Queries run faster, dashboards load quicker, and teams no longer need to guess which workloads justify a materialized view. The platform adapts as your usage changes.&lt;/p&gt;
&lt;h2&gt;Governance and Discoverability with Polaris&lt;/h2&gt;
&lt;p&gt;Managing Iceberg tables at scale requires more than just metadata tracking - it requires unified governance. Dremio’s integration with Apache Polaris gives teams a central catalog that enforces access controls, tracks lineage, and supports multi-engine access through open REST protocols.&lt;/p&gt;
&lt;p&gt;Whether you’re using Spark, Trino, Flink, or Dremio itself, Polaris provides a consistent layer for managing catalogs, namespaces, and Iceberg tables. Service principals and RBAC ensure secure access, while credential vending allows query engines to read data without exposing cloud credentials.&lt;/p&gt;
&lt;p&gt;By offering a unified metastore for all your Iceberg assets, Polaris makes it easier to scale governance and integrate with diverse compute engines, all while maintaining data sovereignty and visibility.&lt;/p&gt;
&lt;h2&gt;AI-Ready Data, Out of the Box&lt;/h2&gt;
&lt;p&gt;As data volumes soar and AI workloads increase, organizations need data platforms that deliver speed and clarity - not maintenance overhead. Dremio’s new features don’t just optimize query performance; they also support AI and analytics with intelligent automation, semantic search, and unified metadata.&lt;/p&gt;
&lt;p&gt;AI-Enabled Semantic Search lets users discover datasets using plain language, not SQL. This reduces time spent hunting for data and accelerates exploration for analysts and data scientists alike. Combined with reflections and clustering, the platform ensures these queries return results fast.&lt;/p&gt;
&lt;p&gt;And because Dremio is built on open standards: Iceberg, Arrow, and Polaris, you can trust that your data architecture will remain portable, interoperable, and vendor-neutral.&lt;/p&gt;
&lt;h2&gt;Real-World Results&lt;/h2&gt;
&lt;p&gt;Dremio has already demonstrated the power of this approach internally. After deploying clustering and autonomous reflections across its own internal lakehouse, Dremio saw:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;80% of dashboards accelerated automatically&lt;/li&gt;
&lt;li&gt;10x reduction in 90th percentile query times&lt;/li&gt;
&lt;li&gt;30x improvement in CPU efficiency per query&lt;/li&gt;
&lt;li&gt;Substantial infrastructure savings by right-sizing compute resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These improvements weren’t the result of hand-tuning or custom engineering. They were achieved through intelligent automation - something every team can now access.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data lakehouses offer unmatched flexibility, but performance and manageability have long remained pain points. With features like Iceberg Clustering, Autonomous Reflections, and Polaris Catalog, Dremio turns the lakehouse into a high-performance, governed, and self-optimizing platform.&lt;/p&gt;
&lt;p&gt;For data engineers, this means fewer manual interventions, faster time-to-insight, and greater confidence in how data is delivered. For analysts and AI teams, it means sub-second queries and easy access to the data they need: no pipeline delays, no tuning required.&lt;/p&gt;
&lt;p&gt;As the final stop in this series, Dremio represents the culmination of modern data engineering principles: openness, automation, and efficiency. If you&apos;re building on Iceberg and want to unlock its full potential, Dremio offers a platform designed not just to support your architecture, but to elevate it.&lt;/p&gt;
&lt;p&gt;To see it in action, try Dremio for free or explore the latest launch to learn how these capabilities can help your team build a faster, smarter lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 10 - Sampling and Prompts in MCP  – Making Agent Workflows Smarter and Safer</title><link>https://iceberglakehouse.com/posts/2025-04-sampling-and-prompts-in-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-sampling-and-prompts-in-mcp/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-10/).

## Free Res...</description><pubDate>Mon, 14 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-10/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ve now seen how the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; allows LLMs to read resources and call tools, giving them access to both data and action.&lt;/p&gt;
&lt;p&gt;But what if your &lt;strong&gt;MCP server&lt;/strong&gt; needs the LLM to make a decision?&lt;/p&gt;
&lt;p&gt;What if it needs to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyze a file before running a tool?&lt;/li&gt;
&lt;li&gt;Draft a message for approval?&lt;/li&gt;
&lt;li&gt;Ask the model to choose between options?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s where &lt;strong&gt;Sampling&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;p&gt;And what if you want to give the user: or the LLM, reusable, structured prompt templates for common workflows?&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;Prompts&lt;/strong&gt; come in.&lt;/p&gt;
&lt;p&gt;In this final post of the series, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How &lt;strong&gt;sampling&lt;/strong&gt; allows servers to request completions from LLMs&lt;/li&gt;
&lt;li&gt;How &lt;strong&gt;prompts&lt;/strong&gt; enable reusable, guided AI interactions&lt;/li&gt;
&lt;li&gt;Best practices for both features&lt;/li&gt;
&lt;li&gt;Real-world use cases that combine everything we’ve covered so far&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Is Sampling in MCP?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Sampling&lt;/strong&gt; is the ability for an MCP server to ask the host to run an LLM completion - on behalf of a tool, prompt, or workflow.&lt;/p&gt;
&lt;p&gt;It lets your server say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Hey, LLM, here’s a prompt and some context. Please respond.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Why is this useful?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You can &lt;strong&gt;generate intermediate reasoning steps&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Let the model &lt;strong&gt;propose actions&lt;/strong&gt; before executing them&lt;/li&gt;
&lt;li&gt;Create more natural &lt;strong&gt;multi-turn agent workflows&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Maintain human-in-the-loop &lt;strong&gt;approval and visibility&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Sampling Flow&lt;/h2&gt;
&lt;p&gt;Here’s the typical lifecycle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The server sends a &lt;code&gt;sampling/createMessage&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;The host (Claude Desktop, etc.) can &lt;strong&gt;review or modify&lt;/strong&gt; the prompt&lt;/li&gt;
&lt;li&gt;The host runs the LLM completion&lt;/li&gt;
&lt;li&gt;The result is sent back to the server&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;This architecture puts &lt;strong&gt;control and visibility in the hands of the user&lt;/strong&gt;, even when the agent logic runs server-side.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;✉️ Message Format&lt;/h2&gt;
&lt;p&gt;Here’s an example &lt;code&gt;sampling/createMessage&lt;/code&gt; request:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;messages&amp;quot;: [
    {
      &amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,
      &amp;quot;content&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
        &amp;quot;text&amp;quot;: &amp;quot;Please summarize this log file.&amp;quot;
      }
    }
  ],
  &amp;quot;systemPrompt&amp;quot;: &amp;quot;You are a helpful developer assistant.&amp;quot;,
  &amp;quot;includeContext&amp;quot;: &amp;quot;thisServer&amp;quot;,
  &amp;quot;maxTokens&amp;quot;: 300
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The host chooses which model to use, what context to include, and whether to show the prompt to the user for confirmation.&lt;/p&gt;
&lt;p&gt;Response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;model&amp;quot;: &amp;quot;claude-3-sonnet&amp;quot;,
  &amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;,
  &amp;quot;content&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
    &amp;quot;text&amp;quot;: &amp;quot;The log file contains several timeout errors and warnings related to database connections.&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the server can act on that response - log it, return it as tool output, or chain it into another step.&lt;/p&gt;
&lt;h3&gt;Best Practices for Sampling&lt;/h3&gt;
&lt;h4&gt;Best Practice Why It Matters&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use clear system prompts Guides model behavior contextually&lt;/li&gt;
&lt;li&gt;Limit tokens Prevent runaway completions&lt;/li&gt;
&lt;li&gt;Structure responses Enables downstream parsing (e.g. JSON, bullets)&lt;/li&gt;
&lt;li&gt;Include only relevant context Keep prompts focused and cost-effective&lt;/li&gt;
&lt;li&gt;Respect user control The host mediates the actual LLM call&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Are Prompts in MCP?&lt;/h3&gt;
&lt;p&gt;Prompts are reusable, structured templates that servers can expose to clients.&lt;/p&gt;
&lt;p&gt;Think of them like slash commands or predefined workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Pre-filled with helpful defaults&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Accept arguments (e.g. &amp;quot;project name&amp;quot;, &amp;quot;file path&amp;quot;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Optionally include embedded resources&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Surface in the client UI&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prompts help users and LLMs collaborate efficiently by standardizing useful tasks.&lt;/p&gt;
&lt;h3&gt;✨ Prompt Structure&lt;/h3&gt;
&lt;p&gt;Prompts have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A name (identifier)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A description (for discovery)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A list of arguments (optional)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A template for generating messages&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;explain-code&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Explain how this code works&amp;quot;,
  &amp;quot;arguments&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;language&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Programming language&amp;quot;,
      &amp;quot;required&amp;quot;: true
    },
    {
      &amp;quot;name&amp;quot;: &amp;quot;code&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;The code to analyze&amp;quot;,
      &amp;quot;required&amp;quot;: true
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;prompts/list&lt;/code&gt; to discover prompts&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;prompts/get&lt;/code&gt; to resolve a prompt and arguments into messages&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Dynamic Prompt Example&lt;/h3&gt;
&lt;p&gt;A server might expose:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;analyze-logs&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Summarize recent logs and detect anomalies&amp;quot;,
  &amp;quot;arguments&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;timeframe&amp;quot;,
      &amp;quot;required&amp;quot;: true
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the user (or LLM) runs it with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;timeframe&amp;quot;: &amp;quot;1h&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The resolved prompt could include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A message like: &lt;code&gt;“Please summarize the following logs from the past hour.”&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;An embedded resource (e.g. &lt;code&gt;logs://recent?timeframe=1h&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Output ready for sampling&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Sampling + Prompts = Dynamic Workflows&lt;/h3&gt;
&lt;p&gt;When you combine prompts + sampling + tools, you unlock real agent behavior.&lt;/p&gt;
&lt;p&gt;Example Workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;User selects prompt: &amp;quot;Analyze logs and suggest next steps&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Server resolves the prompt and calls sampling/createMessage&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LLM returns: “The logs show repeated auth failures. Suggest checking OAuth config.”&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Server calls tools/call to run check_auth_config&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LLM reviews the result and writes a summary&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All controlled via:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Standardized MCP messages&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;User-visible approvals&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modular server logic&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;🔐 Security and Control&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;How It&apos;s Handled&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt visibility&lt;/td&gt;
&lt;td&gt;Clients decide which prompts to expose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sampling review&lt;/td&gt;
&lt;td&gt;Hosts can show/reject sampling requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input validation&lt;/td&gt;
&lt;td&gt;Servers validate prompt arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model usage control&lt;/td&gt;
&lt;td&gt;Hosts select models and limit token costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt injection risks&lt;/td&gt;
&lt;td&gt;Validate user inputs, escape content if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h3&gt;🧠 Why These Matter for AI Agents&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Sampling Provides&lt;/th&gt;
&lt;th&gt;Prompts Provide&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decision-making&lt;/td&gt;
&lt;td&gt;Dynamic LLM completions&lt;/td&gt;
&lt;td&gt;Guided, structured input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Server can request help anytime&lt;/td&gt;
&lt;td&gt;Users can run reusable workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactivity&lt;/td&gt;
&lt;td&gt;Chain actions with feedback&lt;/td&gt;
&lt;td&gt;Improve LLM collaboration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Composability&lt;/td&gt;
&lt;td&gt;Mix prompts + tools + resources&lt;/td&gt;
&lt;td&gt;Enable custom interfaces&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h3&gt;🧩 Wrapping It All Together&lt;/h3&gt;
&lt;p&gt;Over this 10-part series, we’ve explored the full landscape of AI agent development using &lt;strong&gt;MCP&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;✅ LLMs and how they work&lt;br&gt;
✅ Fine-tuning, prompting, and RAG&lt;br&gt;
✅ Agent frameworks and limitations&lt;br&gt;
✅ MCP’s architecture and interoperability&lt;br&gt;
✅ Resources and tools&lt;br&gt;
✅ Prompts and sampling&lt;/p&gt;
&lt;p&gt;MCP gives us standardized, modular building blocks for creating AI agents that are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Portable across environments&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Decoupled from model providers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secure, observable, and controlled&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 9 - Tools in MCP  – Giving LLMs the Power to Act</title><link>https://iceberglakehouse.com/posts/2025-04-tools-in-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-tools-in-mcp/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-09/).

## Free Res...</description><pubDate>Sun, 13 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-09/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous post, we looked at &lt;strong&gt;Resources&lt;/strong&gt; in the Model Context Protocol (MCP): how LLMs can securely access real-world data to ground their understanding. But sometimes, &lt;em&gt;reading&lt;/em&gt; isn’t enough.&lt;/p&gt;
&lt;p&gt;Sometimes, you want the model to &lt;strong&gt;do something&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;Tools&lt;/strong&gt; in MCP come in.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What tools are in MCP&lt;/li&gt;
&lt;li&gt;How tools are discovered and invoked&lt;/li&gt;
&lt;li&gt;How LLMs can use tools (with user control)&lt;/li&gt;
&lt;li&gt;Common tool patterns and security practices&lt;/li&gt;
&lt;li&gt;Real-world examples: from file system commands to API wrappers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s dive in.&lt;/p&gt;
&lt;h2&gt;What Are Tools in MCP?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Tools&lt;/strong&gt; are executable functions that an LLM (or the user) can call via the MCP client. Unlike resources: which are passive data, &lt;strong&gt;tools are active operations&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Examples include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Running a shell command&lt;/li&gt;
&lt;li&gt;Calling a REST API&lt;/li&gt;
&lt;li&gt;Summarizing a document&lt;/li&gt;
&lt;li&gt;Posting a GitHub issue&lt;/li&gt;
&lt;li&gt;Triggering a build process&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each tool includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;name&lt;/strong&gt; (unique identifier)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;description&lt;/strong&gt; (for UI/model understanding)&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;input schema&lt;/strong&gt; (JSON schema describing expected parameters)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Tools allow models to interact with the world beyond natural language - under user oversight.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Discovering Tools&lt;/h2&gt;
&lt;p&gt;Clients can list available tools via:
&lt;code&gt;tools/list&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;tools&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;calculate_sum&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Add two numbers together&amp;quot;,
      &amp;quot;inputSchema&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
        &amp;quot;properties&amp;quot;: {
          &amp;quot;a&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;number&amp;quot; },
          &amp;quot;b&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;number&amp;quot; }
        },
        &amp;quot;required&amp;quot;: [&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;]
      }
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows clients (and LLMs) to decide which tools are available and how to call them properly.&lt;/p&gt;
&lt;h2&gt;⚙️ Calling a Tool&lt;/h2&gt;
&lt;p&gt;To execute a tool, the client sends:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;tools/call
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this payload:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;calculate_sum&amp;quot;,
  &amp;quot;arguments&amp;quot;: {
    &amp;quot;a&amp;quot;: 3,
    &amp;quot;b&amp;quot;: 5
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server responds with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;content&amp;quot;: [
    {
      &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
      &amp;quot;text&amp;quot;: &amp;quot;8&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s it! The LLM can now use this output in a multi-step reasoning chain.&lt;/p&gt;
&lt;h3&gt;Model-Controlled Tool Use&lt;/h3&gt;
&lt;p&gt;Tools are designed to be invoked by models automatically. The host mediates this interaction with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Approval flows (user-in-the-loop)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Permission gating&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Logging and auditing&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is what enables “agentic behavior.” For example:&lt;/p&gt;
&lt;p&gt;Claude sees a CSV file and decides to call analyze_csv to compute averages - without a user explicitly requesting it.&lt;/p&gt;
&lt;h3&gt;Tool Design Patterns&lt;/h3&gt;
&lt;p&gt;Let’s look at some common and powerful tool types:&lt;/p&gt;
&lt;h4&gt;System Tools&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;run_command&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Execute a shell command&amp;quot;,
  &amp;quot;inputSchema&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;command&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
      &amp;quot;args&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;array&amp;quot;,
        &amp;quot;items&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use case: Let the LLM grep a log file, or check system uptime.&lt;/p&gt;
&lt;h4&gt;API Integrations&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;create_github_issue&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Open a new issue on GitHub&amp;quot;,
  &amp;quot;inputSchema&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;repo&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
      &amp;quot;title&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
      &amp;quot;body&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use case: Let an AI dev assistant file bugs or suggest changes.&lt;/p&gt;
&lt;h4&gt;Data Analysis&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;summarize_csv&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Summarize a CSV file&amp;quot;,
  &amp;quot;inputSchema&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;filepath&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use case: Let the LLM analyze performance metrics or user data.&lt;/p&gt;
&lt;h4&gt;Security Best Practices&lt;/h4&gt;
&lt;p&gt;Giving LLMs the ability to take action means security is critical. Here’s how to stay safe:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Validate all input&lt;/strong&gt;
Use detailed JSON schemas&lt;/p&gt;
&lt;p&gt;Sanitize input (e.g., file paths, commands)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use access controls&lt;/strong&gt;
Gate sensitive tools behind roles&lt;/p&gt;
&lt;p&gt;Allow user opt-in or approval&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Log and monitor usage&lt;/strong&gt;
Track which tools are used, with what arguments&lt;/p&gt;
&lt;p&gt;Log errors and output for audit trails&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Handle errors gracefully&lt;/strong&gt;
Return structured errors inside the result, not just raw exceptions. This helps the LLM adapt.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;isError&amp;quot;: true,
  &amp;quot;content&amp;quot;: [
    {
      &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
      &amp;quot;text&amp;quot;: &amp;quot;Error: File not found.&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Example: Implementing a Tool Server in Python&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;@mcp.tool()
async def get_weather(city: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;Return current weather for a city.&amp;quot;&amp;quot;&amp;quot;
    data = await fetch_weather(city)
    return f&amp;quot;The temperature in {city} is {data[&apos;temp&apos;]}°C.&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This tool will automatically appear in the tools/list response and can be invoked by the LLM or user.&lt;/p&gt;
&lt;h3&gt;Why Tools Matter for Agents&lt;/h3&gt;
&lt;p&gt;Agents aren’t just chatbots - they&apos;re interactive systems. Tools give them the ability to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Take real-world actions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Build dynamic workflows&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chain reasoning across multiple steps&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Drive automation in safe, auditable ways&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Combined with resources, prompts, and sampling, tools make LLMs feel like collaborative assistants, not just text predictors.&lt;/p&gt;
&lt;h3&gt;Recap: Tools in MCP&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Concept Description&lt;/li&gt;
&lt;li&gt;Tool definition Name, description, and input schema&lt;/li&gt;
&lt;li&gt;Invocation tools/call with arguments&lt;/li&gt;
&lt;li&gt;Output Text or structured response&lt;/li&gt;
&lt;li&gt;Use case examples Shell commands, API calls, code generation, analysis&lt;/li&gt;
&lt;li&gt;Security guidelines Validate input, log usage, gate sensitive actions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Coming Up Next: Sampling and Prompts : Letting the Server Ask the Model for Help&lt;/h3&gt;
&lt;p&gt;In the final two posts of this series, we’ll explore:&lt;/p&gt;
&lt;p&gt;✅ Sampling : How servers can request completions from the LLM during workflows
✅ Prompts : Reusable templates for user-driven or model-driven actions&lt;/p&gt;
&lt;p&gt;Tools give LLMs the power to act. With proper controls and schemas, they become safe, composable building blocks for real-world automation.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 8 - Resources in MCP  – Serving Relevant Data Securely to LLMs</title><link>https://iceberglakehouse.com/posts/2025-04-resources-in-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-resources-in-mcp/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-08/).

## Free Res...</description><pubDate>Sat, 12 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-08/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous post, we explored the architecture of the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;: a flexible, standardized way to connect LLMs to tools, data, and workflows. One of MCP’s most powerful capabilities is its ability to expose &lt;strong&gt;resources&lt;/strong&gt; to language models in a structured, secure, and controllable way.&lt;/p&gt;
&lt;p&gt;In this post, we’ll dive into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What MCP resources are&lt;/li&gt;
&lt;li&gt;How they’re discovered and accessed&lt;/li&gt;
&lt;li&gt;Text vs binary resources&lt;/li&gt;
&lt;li&gt;Dynamic templates and subscriptions&lt;/li&gt;
&lt;li&gt;Best practices for implementation and security&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to give LLMs real, relevant context from your systems: without compromising safety or control, &lt;strong&gt;resources&lt;/strong&gt; are the foundation.&lt;/p&gt;
&lt;h2&gt;What Are Resources in MCP?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt; represent data that a model or client can read.&lt;/p&gt;
&lt;p&gt;This might include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Local files (e.g. &lt;code&gt;file:///logs/server.log&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Database records (e.g. &lt;code&gt;postgres://db/customers&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Web content (e.g. &lt;code&gt;https://api.example.com/data&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Images or screenshots (e.g. &lt;code&gt;screen://localhost/monitor1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Structured system data (e.g. logs, metrics, config files)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each resource is identified by a &lt;strong&gt;URI&lt;/strong&gt;, and can be &lt;strong&gt;read&lt;/strong&gt;, &lt;strong&gt;discovered&lt;/strong&gt;, and optionally &lt;strong&gt;subscribed to&lt;/strong&gt; for updates.&lt;/p&gt;
&lt;h2&gt;Resource Discovery&lt;/h2&gt;
&lt;p&gt;Clients can ask a server to list available resources using:
&lt;code&gt;resources/list&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The server responds with an array of structured metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;resources&amp;quot;: [
    {
      &amp;quot;uri&amp;quot;: &amp;quot;file:///logs/app.log&amp;quot;,
      &amp;quot;name&amp;quot;: &amp;quot;Application Logs&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Recent server logs&amp;quot;,
      &amp;quot;mimeType&amp;quot;: &amp;quot;text/plain&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients (or users) can browse these like a menu, selecting what context to send to the model.&lt;/p&gt;
&lt;h3&gt;Resource Templates&lt;/h3&gt;
&lt;p&gt;In addition to static lists, servers can expose URI templates using RFC 6570 syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;uriTemplate&amp;quot;: &amp;quot;file:///logs/{date}.log&amp;quot;,
  &amp;quot;name&amp;quot;: &amp;quot;Log by Date&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Access logs by date (e.g., 2024-04-01)&amp;quot;,
  &amp;quot;mimeType&amp;quot;: &amp;quot;text/plain&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows dynamic access to parameterized content - great for APIs, time-based logs, or file hierarchies.&lt;/p&gt;
&lt;h3&gt;Reading a Resource&lt;/h3&gt;
&lt;p&gt;To retrieve the content of a resource, clients use:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;resources/read&lt;/code&gt; With a payload like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;uri&amp;quot;: &amp;quot;file:///logs/app.log&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server responds with the content in one of two formats:&lt;/p&gt;
&lt;h4&gt;Text Resource&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;contents&amp;quot;: [
    {
      &amp;quot;uri&amp;quot;: &amp;quot;file:///logs/app.log&amp;quot;,
      &amp;quot;mimeType&amp;quot;: &amp;quot;text/plain&amp;quot;,
      &amp;quot;text&amp;quot;: &amp;quot;Error: Timeout on request...\n&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Binary Resource (e.g. image, PDF)&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;contents&amp;quot;: [
    {
      &amp;quot;uri&amp;quot;: &amp;quot;screen://localhost/display1&amp;quot;,
      &amp;quot;mimeType&amp;quot;: &amp;quot;image/png&amp;quot;,
      &amp;quot;blob&amp;quot;: &amp;quot;iVBORw0KGgoAAAANSUhEUgAAA...&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients can choose how and when to inject these into the model’s prompt, depending on MIME type and length.&lt;/p&gt;
&lt;h3&gt;Real-Time Updates&lt;/h3&gt;
&lt;p&gt;Resources aren’t static - they can change. MCP supports subscriptions to keep context fresh.&lt;/p&gt;
&lt;h4&gt;List Updates&lt;/h4&gt;
&lt;p&gt;If the list of resources changes, the server can notify the client with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;notifications/resources/list_changed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Useful when new logs, files, or endpoints become available.&lt;/p&gt;
&lt;h4&gt;Content Updates&lt;/h4&gt;
&lt;p&gt;Clients can subscribe to specific resource URIs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;resources/subscribe
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the resource changes, the server sends:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;notifications/resources/updated
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is ideal for live logs, dashboards, or real-time documents.&lt;/p&gt;
&lt;h3&gt;Security Best Practices&lt;/h3&gt;
&lt;p&gt;Exposing resources to models requires careful control. MCP includes flexible patterns for securing access:&lt;/p&gt;
&lt;h4&gt;Best Practices for Server Developers&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Validate all URIs: No open file reads!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Whitelist paths or endpoints for file access&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use descriptive names and MIME types to help clients filter content&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Provide helpful descriptions for the LLM and user&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Support URI templates for scalable access&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Audit access and subscriptions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Avoid leaking secrets in content or metadata&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Safe Log Server&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-ts&quot;&gt;server.setRequestHandler(ListResourcesRequestSchema, async () =&amp;gt; {
  return {
    resources: [
      {
        uri: &amp;quot;file:///logs/app.log&amp;quot;,
        name: &amp;quot;App Logs&amp;quot;,
        mimeType: &amp;quot;text/plain&amp;quot;,
      },
    ],
  };
});

server.setRequestHandler(ReadResourceRequestSchema, async request =&amp;gt; {
  const uri = request.params.uri;

  if (!uri.startsWith(&amp;quot;file:///logs/&amp;quot;)) {
    throw new Error(&amp;quot;Access denied&amp;quot;);
  }

  const content = await readFile(uri); // Add sanitization here
  return {
    contents: [
      {
        uri,
        mimeType: &amp;quot;text/plain&amp;quot;,
        text: content,
      },
    ],
  };
});
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Why Resources Matter for AI Agents&lt;/h3&gt;
&lt;p&gt;LLMs are context-hungry. They reason better when they have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Real-time logs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Source code&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;System metrics&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;API responses&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By serving these as resources, MCP gives agents the data they need - on demand, with full user control, and without bloating prompt templates.&lt;/p&gt;
&lt;h3&gt;Recap: Resources at a Glance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Feature Description&lt;/li&gt;
&lt;li&gt;URI-based identifiers Unique path to each piece of content&lt;/li&gt;
&lt;li&gt;Text &amp;amp; binary support Suitable for logs, images, PDFs, etc.&lt;/li&gt;
&lt;li&gt;Dynamic templates Construct URIs on the fly&lt;/li&gt;
&lt;li&gt;Real-time updates Subscriptions for changing content&lt;/li&gt;
&lt;li&gt;Secure access patterns URI validation, MIME filtering, whitelisting&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Coming Up Next: Tools in MCP : Giving LLMs the Power to Act&lt;/h3&gt;
&lt;p&gt;So far, we’ve shown how MCP feeds models with data. But what if we want the model to take action?&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore tools in MCP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;How LLMs call functions safely&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tool schemas and invocation patterns&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Real-world examples: shell commands, API calls, and more&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 7 - Under the Hood  – The Architecture of MCP and Its Core Components</title><link>https://iceberglakehouse.com/posts/2025-04-under-the-hood-of-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-under-the-hood-of-mcp/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-07/).

# A Journey...</description><pubDate>Fri, 11 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-07/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;A Journey from AI to LLMs and MCP - 7 - Under the Hood : The Architecture of MCP and Its Core Components&lt;/h1&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we introduced the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; as a standard way to connect AI models and agents to tools, data, and workflows - much like how the Apache Iceberg REST protocol brings interoperability to data engines.&lt;/p&gt;
&lt;p&gt;Now it’s time to open the black box.&lt;/p&gt;
&lt;p&gt;In this post, we’ll break down:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The architecture of MCP&lt;/li&gt;
&lt;li&gt;The responsibilities of hosts, clients, and servers&lt;/li&gt;
&lt;li&gt;The message lifecycle and transport layers&lt;/li&gt;
&lt;li&gt;How tools, resources, and prompts plug into the system&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By the end, you’ll understand &lt;strong&gt;how MCP enables secure, modular communication between LLMs and the systems they need to work with.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Big Picture: How MCP Fits Together&lt;/h2&gt;
&lt;p&gt;MCP follows a &lt;strong&gt;client-server architecture&lt;/strong&gt; that enables many-to-many connections between models and systems.&lt;/p&gt;
&lt;p&gt;Here’s the high-level setup:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+------------------------+      +--------------------+
|    Claude Desktop      |      |      Web IDE       |
| (Host + MCP Client)    |      | (Host + MCP Client)|
+------------------------+      +--------------------+
             |                         |
             |     MCP Protocol        |
             |                         |
             v                         v
+------------------------+    +---------------------------+
|   Local Tool Server    |    |     Cloud API Server      |
| (Exposes tools/resources)|  | (Exposes prompts/tools)   |
+------------------------+    +---------------------------+

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each &lt;strong&gt;host&lt;/strong&gt; runs one or more &lt;strong&gt;clients&lt;/strong&gt;, which connect to independent &lt;strong&gt;MCP servers&lt;/strong&gt; exposing functionality in a standardized format.&lt;/p&gt;
&lt;h2&gt;Key Concepts&lt;/h2&gt;
&lt;p&gt;Let’s look at the core components that make this work.&lt;/p&gt;
&lt;h3&gt;1. Hosts&lt;/h3&gt;
&lt;p&gt;Hosts are the applications that run the LLM (e.g. Claude Desktop, VS Code extension, custom browser app). They manage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model interaction (LLM prompts and completions)&lt;/li&gt;
&lt;li&gt;UI and user input&lt;/li&gt;
&lt;li&gt;A registry of connected clients&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;A host might display tools in a sidebar, allow users to pick files (resources), or visualize prompts in a command palette.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;2. Clients&lt;/h3&gt;
&lt;p&gt;An &lt;strong&gt;MCP client&lt;/strong&gt; lives inside a host and connects to a single MCP server. It handles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transport layer setup (e.g. stdio or HTTP/SSE)&lt;/li&gt;
&lt;li&gt;Message exchange (requests, notifications, etc.)&lt;/li&gt;
&lt;li&gt;Proxying server capabilities to the host/model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each client maintains a &lt;strong&gt;1:1 connection with one server&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;3. Servers&lt;/h3&gt;
&lt;p&gt;Servers expose real-world capabilities using the MCP spec. They can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Serve &lt;strong&gt;resources&lt;/strong&gt; (files, logs, database records)&lt;/li&gt;
&lt;li&gt;Define and execute &lt;strong&gt;tools&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Offer reusable &lt;strong&gt;prompts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Request &lt;strong&gt;sampling&lt;/strong&gt; (LLM completions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Servers can run locally (e.g. on your machine) or remotely (e.g. in a cloud API gateway), and can be implemented in any language (Python, TypeScript, C#, etc.).&lt;/p&gt;
&lt;h2&gt;Message Lifecycle in MCP&lt;/h2&gt;
&lt;p&gt;MCP uses a &lt;strong&gt;JSON-RPC 2.0 message format&lt;/strong&gt; to communicate between clients and servers. All communication flows through a structured lifecycle:&lt;/p&gt;
&lt;h3&gt;1. Initialization&lt;/h3&gt;
&lt;p&gt;Before communication starts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Client sends an &lt;code&gt;initialize&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;Server responds with capabilities&lt;/li&gt;
&lt;li&gt;Client sends an &lt;code&gt;initialized&lt;/code&gt; notification&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;This sets up feature negotiation and version compatibility.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;2. Message Types&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request&lt;/td&gt;
&lt;td&gt;A message expecting a response (e.g. &lt;code&gt;tools/call&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response&lt;/td&gt;
&lt;td&gt;Result from a request (e.g. tool output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notification&lt;/td&gt;
&lt;td&gt;One-way message with no response expected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error&lt;/td&gt;
&lt;td&gt;Sent when a request fails or is invalid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each message is wrapped in a &lt;strong&gt;transport layer&lt;/strong&gt; (more on that next).&lt;/p&gt;
&lt;h2&gt;Transport Layer : How Messages Move&lt;/h2&gt;
&lt;p&gt;MCP supports multiple transport mechanisms:&lt;/p&gt;
&lt;h3&gt;Stdio Transport&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Uses standard input/output&lt;/li&gt;
&lt;li&gt;Ideal for local tools and scripts&lt;/li&gt;
&lt;li&gt;Simple, reliable, and works well with command-line tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;HTTP + SSE Transport&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Uses HTTP POST for client-to-server messages&lt;/li&gt;
&lt;li&gt;Uses &lt;strong&gt;Server-Sent Events (SSE)&lt;/strong&gt; for real-time server-to-client updates&lt;/li&gt;
&lt;li&gt;Useful for remote or cloud-based servers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All transports carry JSON-RPC messages and follow the same protocol semantics.&lt;/p&gt;
&lt;h2&gt;MCP Capabilities&lt;/h2&gt;
&lt;p&gt;MCP defines a small number of &lt;strong&gt;core capabilities&lt;/strong&gt;, each with its own request/response patterns.&lt;/p&gt;
&lt;h3&gt;Resources&lt;/h3&gt;
&lt;p&gt;Servers can expose structured data like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Files&lt;/li&gt;
&lt;li&gt;Logs&lt;/li&gt;
&lt;li&gt;API responses&lt;/li&gt;
&lt;li&gt;Screenshots or binary data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Clients can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;List available resources&lt;/li&gt;
&lt;li&gt;Read their contents&lt;/li&gt;
&lt;li&gt;Subscribe to updates (e.g. file changes)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tools&lt;/h3&gt;
&lt;p&gt;Servers define &lt;strong&gt;callable functions&lt;/strong&gt; that agents can invoke. Each tool has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A name&lt;/li&gt;
&lt;li&gt;Description&lt;/li&gt;
&lt;li&gt;JSON schema for inputs&lt;/li&gt;
&lt;li&gt;Output format (text or structured)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tools are &lt;strong&gt;model-controlled&lt;/strong&gt;, meaning the LLM can decide which tool to use based on context.&lt;/p&gt;
&lt;h3&gt;Prompts&lt;/h3&gt;
&lt;p&gt;Servers can expose &lt;strong&gt;reusable prompt templates&lt;/strong&gt; with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Named arguments&lt;/li&gt;
&lt;li&gt;Context bindings (e.g. resources)&lt;/li&gt;
&lt;li&gt;Multi-step workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prompts are &lt;strong&gt;user-controlled&lt;/strong&gt;, meaning users select when to run them.&lt;/p&gt;
&lt;h3&gt;Sampling&lt;/h3&gt;
&lt;p&gt;Servers can &lt;strong&gt;ask&lt;/strong&gt; the host model for completions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Specify conversation history and preferences&lt;/li&gt;
&lt;li&gt;Include system prompt and context&lt;/li&gt;
&lt;li&gt;Receive structured completions (text, image, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows &lt;strong&gt;server-side workflows&lt;/strong&gt; to request natural language responses from the model in real time.&lt;/p&gt;
&lt;h2&gt;Security and Isolation&lt;/h2&gt;
&lt;p&gt;MCP provides strong boundaries between components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hosts control what clients and models can see&lt;/li&gt;
&lt;li&gt;Servers expose only the capabilities they choose&lt;/li&gt;
&lt;li&gt;Clients can sandbox or restrict tool access&lt;/li&gt;
&lt;li&gt;Sampling keeps users in control of what prompts and completions occur&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;This makes MCP suitable for sensitive environments like IDEs, enterprise apps, and privacy-conscious tools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Why This Architecture Matters&lt;/h2&gt;
&lt;p&gt;By standardizing communication between LLMs and tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can plug a new tool into your environment without modifying your agent&lt;/li&gt;
&lt;li&gt;You can build servers once and use them across different LLM clients (Claude, custom, etc.)&lt;/li&gt;
&lt;li&gt;You get &lt;strong&gt;clear separation of concerns&lt;/strong&gt;: tools, data, and models are independently managed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;🔮 Coming Up Next: Resources in MCP : Serving Relevant Data Securely&lt;/h2&gt;
&lt;p&gt;In the next post, we’ll zoom in on the &lt;strong&gt;Resources&lt;/strong&gt; capability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How to structure resources&lt;/li&gt;
&lt;li&gt;How models use them&lt;/li&gt;
&lt;li&gt;Real-world use cases: logs, code, documents, screenshots&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Journey from AI to LLMs and MCP - 6 - Enter the Model Context Protocol (MCP)  – The Interoperability Layer for AI Agents</title><link>https://iceberglakehouse.com/posts/2025-04-model-context-protocol/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-model-context-protocol/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-06/).

## Free Res...</description><pubDate>Thu, 10 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-06/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ve spent the last few posts exploring the growing power of AI agents - how they can reason, plan, and take actions across complex tasks. And we’ve looked at the frameworks that help us build these agents. But if you’ve worked with them, you’ve likely hit a wall:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hardcoded toolchains&lt;/li&gt;
&lt;li&gt;Limited to a specific LLM provider&lt;/li&gt;
&lt;li&gt;No easy way to share tools or data between agents&lt;/li&gt;
&lt;li&gt;No consistent interface across clients&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What if we had a &lt;strong&gt;standard&lt;/strong&gt; that let &lt;strong&gt;any agent talk to any data source or tool&lt;/strong&gt;, regardless of where it lives or what it’s built with?&lt;/p&gt;
&lt;p&gt;That’s exactly what the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; brings to the table.&lt;/p&gt;
&lt;p&gt;And if you’re from the data engineering world, MCP is to AI agents what the &lt;strong&gt;Apache Iceberg REST protocol&lt;/strong&gt; is to analytics:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A universal, pluggable interface that enables many clients to interact with many servers - without tight coupling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;What Is the Model Context Protocol (MCP)?&lt;/h2&gt;
&lt;p&gt;MCP is an &lt;strong&gt;open protocol&lt;/strong&gt; that defines how LLM-powered applications (like agents, IDEs, or copilots) can access &lt;strong&gt;context, tools, and actions&lt;/strong&gt; in a standardized way.&lt;/p&gt;
&lt;p&gt;Think of it as the &amp;quot;interface layer&amp;quot; between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Clients&lt;/strong&gt;: LLMs or AI agents that need context and capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Servers&lt;/strong&gt;: Local or remote services that expose data, tools, or prompts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hosts&lt;/strong&gt;: The environment where the LLM runs (e.g., Claude Desktop, a browser extension, or an IDE plugin)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It defines a &lt;strong&gt;common language&lt;/strong&gt; for exchanging:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Resources&lt;/strong&gt; (data the model can read)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tools&lt;/strong&gt; (functions the model can invoke)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompts&lt;/strong&gt; (templates the user or model can reuse)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sampling&lt;/strong&gt; (ways servers can request completions from the model)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows you to &lt;strong&gt;plug in new capabilities without rearchitecting your agent or retraining your model&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;🧱 How MCP Mirrors Apache Iceberg’s REST Protocol&lt;/h2&gt;
&lt;p&gt;Let’s draw the parallel:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Apache Iceberg REST&lt;/th&gt;
&lt;th&gt;Model Context Protocol (MCP)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standardized API&lt;/td&gt;
&lt;td&gt;REST endpoints for table ops&lt;/td&gt;
&lt;td&gt;JSON-RPC messages for context/tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decouples client/server&lt;/td&gt;
&lt;td&gt;Any engine ↔ any Iceberg catalog&lt;/td&gt;
&lt;td&gt;Any LLM/agent ↔ any tool or data backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-client support&lt;/td&gt;
&lt;td&gt;Spark, Trino, Flink, Dremio&lt;/td&gt;
&lt;td&gt;Claude, custom agents, IDEs, terminals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pluggable backends&lt;/td&gt;
&lt;td&gt;S3, HDFS, Minio, Pure Storage, GCS&lt;/td&gt;
&lt;td&gt;Filesystem, APIs, databases, web services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interoperable tooling&lt;/td&gt;
&lt;td&gt;REST = portable across ecosystems&lt;/td&gt;
&lt;td&gt;MCP = portable across LLM environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Just as Iceberg REST made it possible for &lt;strong&gt;Dremio&lt;/strong&gt; to talk to a table created in &lt;strong&gt;Snowflake&lt;/strong&gt;, MCP allows a tool exposed in &lt;strong&gt;Python on your laptop&lt;/strong&gt; to be used by an LLM in &lt;strong&gt;Claude Desktop&lt;/strong&gt;, a VS Code agent, or even a web-based chatbot.&lt;/p&gt;
&lt;h2&gt;🔁 MCP in Action : A Real-World Use Case&lt;/h2&gt;
&lt;p&gt;Imagine this workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You’re coding in an IDE powered by an AI assistant&lt;/li&gt;
&lt;li&gt;The model wants to read your logs and run some shell scripts&lt;/li&gt;
&lt;li&gt;Your data lives locally, and your tools are custom-built in Python&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With MCP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The IDE (host) runs an &lt;strong&gt;MCP client&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Your Python tool is exposed via an &lt;strong&gt;MCP server&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The AI assistant (client) calls your custom “tail logs” tool&lt;/li&gt;
&lt;li&gt;The results are streamed back, all through the &lt;strong&gt;standardized protocol&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And tomorrow, you could replace that assistant with a different model or switch to a browser-based environment - and everything would still work.&lt;/p&gt;
&lt;h2&gt;The Core Components of MCP&lt;/h2&gt;
&lt;p&gt;Let’s break down the architecture:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Hosts&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;These are environments where the LLM application lives (e.g., Claude Desktop, your IDE). They manage connections to MCP clients.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Clients&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Embedded in the host, each client maintains a connection to a specific server. It speaks MCP’s message protocol and exposes capabilities upstream to the model.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Servers&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Programs that expose capabilities like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;resources/list&lt;/code&gt; and &lt;code&gt;resources/read&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tools/list&lt;/code&gt; and &lt;code&gt;tools/call&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prompts/list&lt;/code&gt; and &lt;code&gt;prompts/get&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sampling/createMessage&lt;/code&gt; (to request completions from the model)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Servers can live anywhere: locally on your machine, behind an API, or running in a cloud environment.&lt;/p&gt;
&lt;h2&gt;What Can MCP Servers Do?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Expose local or remote files&lt;/strong&gt; (logs, documents, screenshots, live data)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Define tools&lt;/strong&gt; for executing business logic, running commands, or calling APIs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide reusable prompt templates&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Request completions from the host model&lt;/strong&gt; (sampling)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;And all of this is done in a protocol-agnostic, secure, pluggable format.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Why This Matters&lt;/h2&gt;
&lt;p&gt;With MCP, we finally get &lt;strong&gt;interoperability in the AI stack&lt;/strong&gt;: a shared interface layer between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LLMs and tools&lt;/li&gt;
&lt;li&gt;Agents and environments&lt;/li&gt;
&lt;li&gt;Models and real-world data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It gives us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Modularity&lt;/strong&gt;: Swap out components without breaking workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reusability&lt;/strong&gt;: Build once, use everywhere&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: Limit what models can see and do through capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability&lt;/strong&gt;: Track how tools are used and what context is passed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language-agnostic integration&lt;/strong&gt;: Servers can be written in Python, JavaScript, C#, and more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, MCP helps you go from &lt;strong&gt;monolithic, tangled agents&lt;/strong&gt; to &lt;strong&gt;modular, composable AI systems&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;What’s Next: Diving Deeper into MCP Internals&lt;/h2&gt;
&lt;p&gt;In the next few posts, we’ll dig into each part of MCP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Message formats and lifecycle&lt;/li&gt;
&lt;li&gt;How resources and tools are structured&lt;/li&gt;
&lt;li&gt;Sampling, prompts, and real-time feedback loops&lt;/li&gt;
&lt;li&gt;Best practices for building your own MCP server&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 5 - AI Agent Frameworks  – Benefits and Limitations</title><link>https://iceberglakehouse.com/posts/2025-04-ai-agent-frameworks/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-ai-agent-frameworks/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-05/).

## Free Res...</description><pubDate>Wed, 09 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-05/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we explored what makes an &lt;strong&gt;AI agent&lt;/strong&gt; different from a traditional LLM - memory, tools, reasoning, and autonomy. These agents are the foundation of a new generation of intelligent applications.&lt;/p&gt;
&lt;p&gt;But how are these agents built today?&lt;/p&gt;
&lt;p&gt;Enter &lt;strong&gt;agent frameworks&lt;/strong&gt; - open-source libraries and developer toolkits that let you create goal-driven AI systems by wiring together models, memory, tools, and logic. These frameworks are enabling some of the most exciting innovations in the AI space... but they also come with trade-offs.&lt;/p&gt;
&lt;p&gt;In this post, we’ll dive into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What AI agent frameworks are&lt;/li&gt;
&lt;li&gt;The most popular frameworks available today&lt;/li&gt;
&lt;li&gt;The benefits they offer&lt;/li&gt;
&lt;li&gt;Where they fall short&lt;/li&gt;
&lt;li&gt;Why we need something more modular and flexible (spoiler: MCP)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Is an AI Agent Framework?&lt;/h2&gt;
&lt;p&gt;An AI agent framework is a development toolkit that simplifies the process of building &lt;strong&gt;LLM-powered systems&lt;/strong&gt; capable of reasoning, acting, and learning in real time. These frameworks abstract away much of the complexity involved in working with large language models (LLMs) by bundling together key components like memory, tools, task planning, and context management.&lt;/p&gt;
&lt;p&gt;Agent frameworks shift the focus from &amp;quot;generating text&amp;quot; to &amp;quot;completing goals.&amp;quot; They let developers orchestrate multi-step workflows where an LLM isn&apos;t just answering questions but taking action, executing logic, and retrieving relevant data.&lt;/p&gt;
&lt;h3&gt;Memory&lt;/h3&gt;
&lt;p&gt;Memory in AI agents refers to how information from past interactions is stored, retrieved, and reused. This can be split into two primary types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Short-term memory&lt;/strong&gt;: Keeps track of the current conversation or task state. Usually implemented as a conversation history buffer or rolling context window.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Long-term memory&lt;/strong&gt;: Stores past interactions, facts, or discoveries for reuse across sessions. Typically backed by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;vector database&lt;/strong&gt; (e.g., Pinecone, FAISS, Weaviate)&lt;/li&gt;
&lt;li&gt;Embedding models that turn text into numerical vectors&lt;/li&gt;
&lt;li&gt;A retrieval layer that finds the most relevant memories using similarity search&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Text is embedded into a vector representation (via models like OpenAI’s &lt;code&gt;text-embedding-ada-002&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;These vectors are stored in a database&lt;/li&gt;
&lt;li&gt;When new input arrives, it’s embedded and compared to stored vectors&lt;/li&gt;
&lt;li&gt;Top matches are fetched and injected into the LLM prompt as background context&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tools&lt;/h3&gt;
&lt;p&gt;Tools are external functions that the agent can invoke to perform actions or retrieve live information. These can include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Calling an API (e.g., weather, GitHub, SQL query)&lt;/li&gt;
&lt;li&gt;Executing a shell command or script&lt;/li&gt;
&lt;li&gt;Reading a file or database&lt;/li&gt;
&lt;li&gt;Sending a message or triggering an automation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frameworks like &lt;strong&gt;LangChain&lt;/strong&gt;, &lt;strong&gt;AutoGPT&lt;/strong&gt;, and &lt;strong&gt;Semantic Kernel&lt;/strong&gt; often use JSON schemas to define tool inputs and outputs. LLMs &amp;quot;see&amp;quot; tool descriptions and decide when and how to invoke them.&lt;/p&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each tool is registered with a name, description, and parameter schema&lt;/li&gt;
&lt;li&gt;The LLM is given a list of available tools and their specs&lt;/li&gt;
&lt;li&gt;When the LLM &amp;quot;decides&amp;quot; to use a tool, it returns a structured tool call (e.g., &lt;code&gt;{&amp;quot;name&amp;quot;: &amp;quot;search_docs&amp;quot;, &amp;quot;args&amp;quot;: {&amp;quot;query&amp;quot;: &amp;quot;sales trends&amp;quot;}}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The framework intercepts the call, executes the corresponding function, and feeds the result back to the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows the agent to &amp;quot;act&amp;quot; on the world, not just describe it.&lt;/p&gt;
&lt;h3&gt;🧠 Reasoning and Planning&lt;/h3&gt;
&lt;p&gt;Reasoning is what enables agents to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Decompose goals into steps&lt;/li&gt;
&lt;li&gt;Decide what tools or memory to use&lt;/li&gt;
&lt;li&gt;Track intermediate results&lt;/li&gt;
&lt;li&gt;Adjust their strategy based on feedback&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frameworks often support:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;React-style loops&lt;/strong&gt;: Reasoning + action → observation → repeat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Planner-executor separation&lt;/strong&gt;: One model plans, another carries out steps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task graphs&lt;/strong&gt;: Nodes (LLM calls, tools, decisions) arranged in a DAG&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The LLM is prompted to plan tasks using a scratchpad (e.g., &amp;quot;Thought → Action → Observation&amp;quot;)&lt;/li&gt;
&lt;li&gt;The agent parses the output to decide the next step&lt;/li&gt;
&lt;li&gt;Control flow logic (loops, retries, branches) is often implemented in code, not by the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This turns the agent into a &lt;strong&gt;semi-autonomous problem-solver&lt;/strong&gt;, not just a one-shot prompt engine.&lt;/p&gt;
&lt;h3&gt;🧾 Context Management&lt;/h3&gt;
&lt;p&gt;Context management is about deciding &lt;strong&gt;what information gets passed into the LLM prompt&lt;/strong&gt; at any given time. This is critical because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Token limits constrain how much data can be included&lt;/li&gt;
&lt;li&gt;Irrelevant information can degrade model performance&lt;/li&gt;
&lt;li&gt;Sensitive data must be filtered for security and compliance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frameworks handle context by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Selecting relevant memory or documents via vector search&lt;/li&gt;
&lt;li&gt;Condensing history into summaries&lt;/li&gt;
&lt;li&gt;Prioritizing inputs (e.g., task instructions, user preferences, retrieved data)&lt;/li&gt;
&lt;li&gt;Inserting only high-signal content into the prompt&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Context is assembled as structured messages (usually in OpenAI or Anthropic chat formats)&lt;/li&gt;
&lt;li&gt;Some frameworks dynamically prune, summarize, or chunk data to fit within model limits&lt;/li&gt;
&lt;li&gt;Smart caching or pagination may be used to maintain continuity across long sessions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent frameworks abstract complex functionality into composable components:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;How It Works Under the Hood&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Recalls past interactions and facts&lt;/td&gt;
&lt;td&gt;Vector embeddings, similarity search, context injection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;Executes real-world actions&lt;/td&gt;
&lt;td&gt;Function schemas, LLM tool calls, output feedback loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Plans steps, decides next action&lt;/td&gt;
&lt;td&gt;Thought-action-observation loops, scratchpads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Mgmt&lt;/td&gt;
&lt;td&gt;Curates what the model sees&lt;/td&gt;
&lt;td&gt;Dynamic prompt construction, summarization, filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Together, these allow developers to build &lt;strong&gt;goal-seeking agents&lt;/strong&gt; that work across domains - analytics, support, operations, creative work, and more.&lt;/p&gt;
&lt;p&gt;Agent frameworks provide the scaffolding. LLMs provide the intelligence.&lt;/p&gt;
&lt;h2&gt;Popular AI Agent Frameworks&lt;/h2&gt;
&lt;p&gt;Let’s look at some of the leading options:&lt;/p&gt;
&lt;h3&gt;LangChain&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python, JavaScript&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Large ecosystem of components&lt;/li&gt;
&lt;li&gt;Support for chains, tools, memory, agents&lt;/li&gt;
&lt;li&gt;Integrates with most major LLMs, vector DBs, and APIs&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Can become overly complex&lt;/li&gt;
&lt;li&gt;Boilerplate-heavy for simple tasks&lt;/li&gt;
&lt;li&gt;Hard to reason about internal agent state&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;AutoGPT / BabyAGI&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Fully autonomous task execution loops&lt;/li&gt;
&lt;li&gt;Goal-first architecture (recursive reasoning)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Unpredictable behavior (&amp;quot;runaway agents&amp;quot;)&lt;/li&gt;
&lt;li&gt;Tooling and error handling are immature&lt;/li&gt;
&lt;li&gt;Not production-grade (yet)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Semantic Kernel (Microsoft)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: C#, Python&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Enterprise-ready tooling&lt;/li&gt;
&lt;li&gt;Strong integration with Microsoft ecosystems&lt;/li&gt;
&lt;li&gt;Planner APIs and plugin system&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Steeper learning curve&lt;/li&gt;
&lt;li&gt;Limited community and examples&lt;/li&gt;
&lt;li&gt;More opinionated structure&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;CrewAI / MetaGPT&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Multi-agent collaboration&lt;/li&gt;
&lt;li&gt;Role-based task assignment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Heavy on orchestration&lt;/li&gt;
&lt;li&gt;Still early in maturity&lt;/li&gt;
&lt;li&gt;Debugging agent interactions is hard&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of Using an Agent Framework&lt;/h2&gt;
&lt;p&gt;These tools have unlocked new possibilities for developers building AI-powered workflows. Let’s summarize the major benefits:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Abstractions for Tools&lt;/td&gt;
&lt;td&gt;Call APIs or local functions directly from within agent flows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in Memory&lt;/td&gt;
&lt;td&gt;Manage short-term context and long-term recall without manual prompt engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modular Design&lt;/td&gt;
&lt;td&gt;Compose systems using interchangeable components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Planning + Looping&lt;/td&gt;
&lt;td&gt;Support multi-step task execution with feedback loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rapid Prototyping&lt;/td&gt;
&lt;td&gt;Build functional AI assistants quickly with reusable components&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In short: &lt;strong&gt;agent frameworks supercharge developer productivity&lt;/strong&gt; when working with LLMs.&lt;/p&gt;
&lt;h2&gt;Where Agent Frameworks Fall Short&lt;/h2&gt;
&lt;p&gt;Despite all their strengths, modern agent frameworks share some core limitations:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Tight Coupling to Models and Providers&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Most frameworks are tightly bound to OpenAI, Anthropic, or Hugging Face models. Switching providers: or supporting multiple, is complex and risky.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want to try Claude instead of GPT-4? You might need to refactor your entire chain.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;2. &lt;strong&gt;Context Management Is Manual and Error-Prone&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Choosing what context to pass to the LLM (memory, docs, prior results) is often left to the developer. It’s:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hard to debug&lt;/li&gt;
&lt;li&gt;Easy to overrun token limits&lt;/li&gt;
&lt;li&gt;Non-standardized&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Lack of Interoperability&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Most frameworks don’t play well together. Tools, memory stores, and prompt logic often live in their own silos.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can’t easily plug a LangChain tool into a Semantic Kernel workflow.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;4. &lt;strong&gt;Hard to Secure and Monitor&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Giving agents tool access (e.g., shell commands, APIs) is powerful but risky:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No standard for input validation&lt;/li&gt;
&lt;li&gt;No logging/auditing for tool usage&lt;/li&gt;
&lt;li&gt;Few controls for human-in-the-loop approvals&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Opaque Agent Logic&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Agents often make decisions that are hard to trace or debug. Why did the agent call that tool? Why did it loop forever?&lt;/p&gt;
&lt;h2&gt;The Missing Layer: Standardized Context + Tool Protocols&lt;/h2&gt;
&lt;p&gt;We need a better abstraction layer - something that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Decouples LLMs from the tools and data they use&lt;/li&gt;
&lt;li&gt;Allows agents to access secure, structured resources&lt;/li&gt;
&lt;li&gt;Enables modular, composable agents across languages and platforms&lt;/li&gt;
&lt;li&gt;Works with any client, model, or provider&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s where the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;h2&gt;What’s Next: Introducing the Model Context Protocol (MCP)&lt;/h2&gt;
&lt;p&gt;In the next post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What MCP is&lt;/li&gt;
&lt;li&gt;How it enables secure, flexible agent architectures&lt;/li&gt;
&lt;li&gt;Why it&apos;s the “USB-C port” for LLMs and tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll walk through the architecture and show how MCP solves many of the problems outlined in this post.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 4 - What Are AI Agents  – And Why They&apos;re the Future of LLM Applications</title><link>https://iceberglakehouse.com/posts/2025-04-what-are-ai-agents/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-what-are-ai-agents/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-04/).

## Free Res...</description><pubDate>Tue, 08 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-04/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ve explored how Large Language Models (LLMs) work, and how we can improve their performance with fine-tuning, prompt engineering, and retrieval-augmented generation (RAG). These enhancements are powerful - but they’re still fundamentally &lt;em&gt;stateless&lt;/em&gt; and reactive.&lt;/p&gt;
&lt;p&gt;To build systems that act with purpose, adapt over time, and accomplish multi-step goals, we need something more.&lt;/p&gt;
&lt;p&gt;That “something” is the &lt;strong&gt;AI Agent&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What AI agents are&lt;/li&gt;
&lt;li&gt;How they differ from LLMs&lt;/li&gt;
&lt;li&gt;What components make up an agent&lt;/li&gt;
&lt;li&gt;Real-world examples of agent use&lt;/li&gt;
&lt;li&gt;Why agents are a crucial next step for AI&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Is an AI Agent?&lt;/h2&gt;
&lt;p&gt;At a high level, an &lt;strong&gt;AI agent&lt;/strong&gt; is an autonomous or semi-autonomous system built around an LLM, capable of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Observing its environment (inputs, tools, data)&lt;/li&gt;
&lt;li&gt;Reasoning or planning&lt;/li&gt;
&lt;li&gt;Taking actions&lt;/li&gt;
&lt;li&gt;Learning or adapting over time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;LLMs generate responses, but &lt;strong&gt;agents make decisions&lt;/strong&gt;. They don’t just answer; they &lt;em&gt;think, decide, and act&lt;/em&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Think of the difference between a calculator and a virtual assistant. One gives answers. The other &lt;em&gt;gets things done&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;The Core Ingredients of an AI Agent&lt;/h2&gt;
&lt;p&gt;Let’s break down what typically makes up an agentic system:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;LLM Core&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The brain of the operation. Handles natural language understanding and generation.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Tools / Actions&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Agents can execute external commands, like calling APIs, querying databases, or running code.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Memory&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Persistent memory lets agents recall previous interactions, facts, or task states.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Planner / Executor Logic&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;This is where agents shine. They can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Break down complex goals into subtasks&lt;/li&gt;
&lt;li&gt;Decide which tools or steps to take&lt;/li&gt;
&lt;li&gt;Loop, retry, or adapt based on results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Context Manager&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Decides what information (memory, documents, tool results) gets included in each LLM prompt.&lt;/p&gt;
&lt;h2&gt;LLM vs AI Agent : Key Differences&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;LLM&lt;/th&gt;
&lt;th&gt;AI Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input&lt;/td&gt;
&lt;td&gt;Prompt&lt;/td&gt;
&lt;td&gt;Prompt + tools + state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Ephemeral (context)&lt;/td&gt;
&lt;td&gt;Persistent (via external memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Single-shot&lt;/td&gt;
&lt;td&gt;Multi-step planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action-taking&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (tools, APIs, workflows)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autonomy&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Optional (user- or goal-directed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adaptability&lt;/td&gt;
&lt;td&gt;Static behavior&lt;/td&gt;
&lt;td&gt;Dynamic, can learn from feedback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;LLMs are the engine. Agents are the vehicle.&lt;/p&gt;
&lt;h2&gt;Examples of AI Agents in the Wild&lt;/h2&gt;
&lt;p&gt;Let’s explore how AI agents are already showing up in real-world applications:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Developer Copilots&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Tools like GitHub Copilot or Cursor act as coding assistants, not just autocomplete engines. They:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read your project files&lt;/li&gt;
&lt;li&gt;Ask clarifying questions&lt;/li&gt;
&lt;li&gt;Suggest multi-line changes&lt;/li&gt;
&lt;li&gt;Run code against test cases&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Document Q&amp;amp;A Assistants&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Instead of just answering questions, agents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Search relevant documents&lt;/li&gt;
&lt;li&gt;Summarize findings&lt;/li&gt;
&lt;li&gt;Ask follow-up questions&lt;/li&gt;
&lt;li&gt;Offer next actions (e.g., generate reports)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Research Agents&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Given a broad prompt like &lt;em&gt;“summarize recent news on AI regulation,”&lt;/em&gt; agents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plan a research strategy&lt;/li&gt;
&lt;li&gt;Browse the web or internal data&lt;/li&gt;
&lt;li&gt;Synthesize and refine results&lt;/li&gt;
&lt;li&gt;Ask for confirmation before continuing&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;🔄 Agents Enable Autonomy and Feedback Loops&lt;/h2&gt;
&lt;p&gt;Unlike plain LLMs, agents can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;tools&lt;/strong&gt; to gather more info&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loop&lt;/strong&gt; on tasks until a condition is met&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Store and recall&lt;/strong&gt; what they’ve seen&lt;/li&gt;
&lt;li&gt;Chain multiple steps together&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;For example:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Schedule a meeting with Alice&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Agent:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Search calendar availability&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Find Alice’s preferred times&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Draft an email proposal&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Wait for response&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reschedule if needed&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s not a single LLM prompt, that’s an intelligent system managing an evolving task.&lt;/p&gt;
&lt;h2&gt;How Are Agents Built Today?&lt;/h2&gt;
&lt;p&gt;A number of popular &lt;strong&gt;AI agent frameworks&lt;/strong&gt; have emerged:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LangChain&lt;/strong&gt;: Modular orchestration of LLMs, tools, and memory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AutoGPT&lt;/strong&gt;: Autonomous task completion with iterative planning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Kernel&lt;/strong&gt;: Microsoft’s framework for embedding LLMs into software&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CrewAI / MetaGPT&lt;/strong&gt;: Multi-agent systems with defined roles&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These frameworks let developers prototype powerful workflows, but they come with challenges - especially around complexity, tool integration, and portability.&lt;/p&gt;
&lt;p&gt;We’ll explore those challenges in the next post.&lt;/p&gt;
&lt;h2&gt;Limitations of Today’s Agent Implementations&lt;/h2&gt;
&lt;p&gt;While agents are promising, current frameworks have some limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tight coupling&lt;/strong&gt; to specific models or tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Difficult interoperability&lt;/strong&gt; between agent components&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context juggling&lt;/strong&gt;: hard to manage what the model sees&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security and control&lt;/strong&gt;: risk of unsafe tool access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hard to debug&lt;/strong&gt;: agents can go rogue or get stuck in loops&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To address these, we need &lt;strong&gt;standardization&lt;/strong&gt;: a modular way to plug in data, tools, and models securely and flexibly.&lt;/p&gt;
&lt;p&gt;That’s where the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; enters the picture.&lt;/p&gt;
&lt;h2&gt;Coming Up Next: AI Agent Frameworks : Benefits and Limitations&lt;/h2&gt;
&lt;p&gt;In our next post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How modern agent frameworks work&lt;/li&gt;
&lt;li&gt;What they enable (and where they fall short)&lt;/li&gt;
&lt;li&gt;The missing layer that MCP provides&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 3 - Boosting LLM Performance  – Fine-Tuning, Prompt Engineering, and RAG</title><link>https://iceberglakehouse.com/posts/2025-04-boosting-llm-performance/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-boosting-llm-performance/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-03/).

## Free Res...</description><pubDate>Mon, 07 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-03/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we explored how LLMs process text using embeddings and vector spaces within limited context windows. While LLMs are powerful out-of-the-box, they aren’t perfect - and in many real-world scenarios, we need to push them further.&lt;/p&gt;
&lt;p&gt;That’s where enhancement techniques come in.&lt;/p&gt;
&lt;p&gt;In this post, we’ll walk through the three most popular and practical ways to &lt;strong&gt;boost the performance of Large Language Models (LLMs)&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fine-tuning&lt;/li&gt;
&lt;li&gt;Prompt engineering&lt;/li&gt;
&lt;li&gt;Retrieval-Augmented Generation (RAG)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each approach has its strengths, trade-offs, and ideal use cases. By the end, you’ll know when to use each - and how they work under the hood.&lt;/p&gt;
&lt;h2&gt;1. Fine-Tuning : Teaching the Model New Tricks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; is the process of training an existing LLM on custom datasets to improve its behavior on specific tasks.&lt;/p&gt;
&lt;h3&gt;How it works:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You take a pre-trained model (like GPT or LLaMA).&lt;/li&gt;
&lt;li&gt;You feed it new examples in a structured format (instructions + completions).&lt;/li&gt;
&lt;li&gt;The model updates its internal weights based on this new data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Think of it like giving the model a focused education after it’s graduated from a general AI university.&lt;/p&gt;
&lt;h3&gt;When to use it:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You want a custom assistant that uses your company’s voice&lt;/li&gt;
&lt;li&gt;You need the model to perform a specialized task (e.g., legal analysis, medical diagnostics)&lt;/li&gt;
&lt;li&gt;You have recurring, structured inputs that aren’t handled well with prompting alone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Trade-offs:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Highly accurate for specific tasks&lt;/td&gt;
&lt;td&gt;Expensive (compute + time)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduces prompt complexity&lt;/td&gt;
&lt;td&gt;Risk of overfitting or forgetting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works well offline or locally&lt;/td&gt;
&lt;td&gt;Not ideal for frequently changing data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Fine-tuning is powerful, but it’s not always the first choice - especially when you need flexibility or real-time knowledge.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;2. Prompt Engineering : Speaking the Model’s Language&lt;/h2&gt;
&lt;p&gt;Sometimes, you don’t need to retrain the model - you just need to &lt;em&gt;talk to it better&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; is the art of crafting inputs that guide the model to behave the way you want. It’s fast, flexible, and doesn’t require model access.&lt;/p&gt;
&lt;h3&gt;Prompting patterns:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Zero-shot prompting&lt;/strong&gt;: Just ask a question
&lt;blockquote&gt;
&lt;p&gt;“Summarize this article.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Few-shot prompting&lt;/strong&gt;: Show examples
&lt;blockquote&gt;
&lt;p&gt;“Here’s how I want you to respond…”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chain-of-Thought (CoT)&lt;/strong&gt;: Encourage reasoning
&lt;blockquote&gt;
&lt;p&gt;“Let’s think step by step…”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tools and techniques:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Templates: Reusable format strings with variables&lt;/li&gt;
&lt;li&gt;Constraints: “Answer in JSON” or “Limit to 100 words”&lt;/li&gt;
&lt;li&gt;Personas: “You are a helpful legal assistant...”&lt;/li&gt;
&lt;li&gt;System prompts (where supported): Define role and tone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to use it:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You’re working with a hosted LLM (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;li&gt;You want to avoid infrastructure and cost overhead&lt;/li&gt;
&lt;li&gt;You need to quickly iterate and improve outcomes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Trade-offs:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fast to test and implement&lt;/td&gt;
&lt;td&gt;Sensitive to wording&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doesn’t require model access&lt;/td&gt;
&lt;td&gt;Can be brittle or unpredictable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Great for prototyping&lt;/td&gt;
&lt;td&gt;Doesn’t scale well for complex logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Prompt engineering is like UX for AI - small changes in input can completely change the output.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;3. Retrieval-Augmented Generation (RAG) : Give the Model Real-Time Knowledge&lt;/h2&gt;
&lt;p&gt;RAG is a game-changer for context-aware applications.&lt;/p&gt;
&lt;p&gt;Instead of cramming all your knowledge into a model, &lt;strong&gt;RAG retrieves relevant information at runtime&lt;/strong&gt; and includes it in the prompt.&lt;/p&gt;
&lt;h3&gt;How it works:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;User sends a query&lt;/li&gt;
&lt;li&gt;System runs a &lt;strong&gt;semantic search&lt;/strong&gt; over a vector database&lt;/li&gt;
&lt;li&gt;Top-matching documents are inserted into the prompt&lt;/li&gt;
&lt;li&gt;The LLM generates a response using both query + retrieved context&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This gives you &lt;strong&gt;dynamic, real-time access&lt;/strong&gt; to external knowledge - without retraining.&lt;/p&gt;
&lt;h3&gt;Typical RAG architecture:&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;User → Query → Vector Search (Embeddings) → Top K Documents → LLM Prompt → Response
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use case examples:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Chatbots that answer questions from company docs&lt;/li&gt;
&lt;li&gt;Developer copilots that can search codebases&lt;/li&gt;
&lt;li&gt;LLMs that read log files, support tickets, or PDFs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Trade-offs:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Real-time access to changing data&lt;/td&gt;
&lt;td&gt;Adds latency due to search layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No need to retrain the model&lt;/td&gt;
&lt;td&gt;Requires infrastructure (DB + search)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keeps context windows lean&lt;/td&gt;
&lt;td&gt;Needs good chunking &amp;amp; ranking logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;With RAG, your LLM becomes a smart interface to &lt;em&gt;your&lt;/em&gt; data - not just the internet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Choosing the Right Enhancement Technique&lt;/h2&gt;
&lt;p&gt;Here’s a quick cheat sheet to help you choose:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Best Technique&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Specialize a model on internal tasks&lt;/td&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guide output or behavior flexibly&lt;/td&gt;
&lt;td&gt;Prompt engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inject dynamic, real-time knowledge&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Gen&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Often, the best systems &lt;strong&gt;combine&lt;/strong&gt; these techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fine-tuned base model&lt;/li&gt;
&lt;li&gt;With prompt templates&lt;/li&gt;
&lt;li&gt;And external knowledge via RAG&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is exactly what advanced AI agent systems are starting to do - and it’s where we’re heading next.&lt;/p&gt;
&lt;h2&gt;Recap: Boosting LLMs Is All About Context and Control&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Ideal For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fine-Tuning&lt;/td&gt;
&lt;td&gt;Teaches model new behavior&lt;/td&gt;
&lt;td&gt;Repetitive, specialized tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Engineering&lt;/td&gt;
&lt;td&gt;Crafts effective inputs&lt;/td&gt;
&lt;td&gt;Fast prototyping, hosted models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Adds knowledge dynamically at runtime&lt;/td&gt;
&lt;td&gt;Large, evolving, external datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;Up Next: What Are AI Agents : And Why They’re the Future&lt;/h2&gt;
&lt;p&gt;Now that we’ve learned how to enhance individual LLMs, the next evolution is combining them with tools, memory, and logic to create &lt;strong&gt;AI Agents&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What makes something an AI agent&lt;/li&gt;
&lt;li&gt;How agents orchestrate LLMs + tools&lt;/li&gt;
&lt;li&gt;Why they’re essential for real-world use&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 2 - How LLMs Work  – Embeddings, Vectors, and Context Windows</title><link>https://iceberglakehouse.com/posts/2025-04-how-llms-work/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-how-llms-work/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-02/).

## Free Res...</description><pubDate>Sun, 06 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-02/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we explored the evolution of AI: from rule-based systems to deep learning, and how &lt;strong&gt;Large Language Models (LLMs)&lt;/strong&gt; like GPT-4 and Claude represent a transformative leap in capability.&lt;/p&gt;
&lt;p&gt;But how do these models &lt;em&gt;actually&lt;/em&gt; work?&lt;/p&gt;
&lt;p&gt;In this post, we’ll peel back the curtain on the inner workings of LLMs. We’ll explore the fundamental concepts that make these models tick: &lt;strong&gt;embeddings&lt;/strong&gt;, &lt;strong&gt;vector spaces&lt;/strong&gt;, and &lt;strong&gt;context windows&lt;/strong&gt;. You’ll walk away with a clearer understanding of how LLMs “understand” language - and what their limits are.&lt;/p&gt;
&lt;h2&gt;How LLMs Think: It’s All Math Underneath&lt;/h2&gt;
&lt;p&gt;Despite their fluent text output, LLMs don’t truly &amp;quot;understand&amp;quot; language in the human sense. Instead, they operate on numerical representations of text, using vast networks of mathematical weights to predict the next word in a sequence.&lt;/p&gt;
&lt;p&gt;The key mechanism behind this: &lt;strong&gt;transformers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Transformers revolutionized NLP by allowing models to weigh the relevance of each word in a sentence: &lt;strong&gt;attention mechanisms&lt;/strong&gt;, instead of processing words one-by-one like RNNs.&lt;/p&gt;
&lt;p&gt;Here’s the simplified flow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Text is &lt;strong&gt;tokenized&lt;/strong&gt; (split into chunks)&lt;/li&gt;
&lt;li&gt;Tokens are converted into &lt;strong&gt;embeddings&lt;/strong&gt; (vectors)&lt;/li&gt;
&lt;li&gt;Those vectors pass through &lt;strong&gt;layers of attention&lt;/strong&gt; to capture meaning&lt;/li&gt;
&lt;li&gt;The model generates the next token based on probability&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;But what are these &lt;strong&gt;embeddings&lt;/strong&gt; and why do they matter?&lt;/p&gt;
&lt;h2&gt;Embeddings: From Words to Numbers&lt;/h2&gt;
&lt;p&gt;Before an LLM can do anything with language, it must convert words into numbers it can operate on.&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;embeddings&lt;/strong&gt; come in.&lt;/p&gt;
&lt;h3&gt;What is an embedding?&lt;/h3&gt;
&lt;p&gt;An embedding is a &lt;strong&gt;high-dimensional vector&lt;/strong&gt; (think: a long list of numbers) that represents the meaning of a word or phrase.&lt;/p&gt;
&lt;p&gt;Words with similar meanings have &lt;strong&gt;similar embeddings&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Embedding(&amp;quot;dog&amp;quot;) ≈ Embedding(&amp;quot;puppy&amp;quot;) Embedding(&amp;quot;Paris&amp;quot;) ≈ Embedding(&amp;quot;London&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These vectors live in an abstract &lt;strong&gt;vector space&lt;/strong&gt;, where distance encodes similarity.&lt;/p&gt;
&lt;p&gt;LLMs use embeddings not just for input, but throughout every layer of their neural network to understand relationships, context, and meaning.&lt;/p&gt;
&lt;h2&gt;Vector Search and Semantic Understanding&lt;/h2&gt;
&lt;p&gt;Because embeddings encode meaning, they’re also incredibly useful for &lt;strong&gt;semantic search&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Instead of matching exact words (like keyword search), vector search compares embeddings to find text that’s &lt;em&gt;conceptually&lt;/em&gt; similar.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Query: &amp;quot;How do I fix a leaking pipe?&amp;quot;&lt;/li&gt;
&lt;li&gt;Match: &amp;quot;Plumbing repair for minor water leaks&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even though the words don’t overlap, the &lt;strong&gt;meaning&lt;/strong&gt; does - and that’s what embeddings capture.&lt;/p&gt;
&lt;p&gt;This is the foundation for many powerful AI techniques like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Document similarity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt; (more on this in Blog 3)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context injection from external data sources&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Context Windows: The Model’s Working Memory&lt;/h2&gt;
&lt;p&gt;Another crucial concept in LLMs is the &lt;strong&gt;context window&lt;/strong&gt;: the maximum number of tokens the model can “see” at once.&lt;/p&gt;
&lt;p&gt;Every input to an LLM gets broken into &lt;strong&gt;tokens&lt;/strong&gt;, and the model has a limited capacity for how many tokens it can process per request.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Max Context Window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-3.5&lt;/td&gt;
&lt;td&gt;4,096 tokens (~3,000 words)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4 Turbo&lt;/td&gt;
&lt;td&gt;Up to 128,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3 Opus&lt;/td&gt;
&lt;td&gt;Up to 200,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you go over the limit, you’ll need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Truncate input (losing information)&lt;/li&gt;
&lt;li&gt;Summarize&lt;/li&gt;
&lt;li&gt;Use techniques like RAG or memory management&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The larger the context window, the more the model can “remember” during a conversation or task.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Limitations of Embeddings and Context Windows&lt;/h2&gt;
&lt;p&gt;Even though LLMs are powerful, they come with trade-offs:&lt;/p&gt;
&lt;h3&gt;Embedding limitations:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Don’t always reflect &lt;strong&gt;nuanced context&lt;/strong&gt; (e.g., sarcasm, tone)&lt;/li&gt;
&lt;li&gt;Fixed dimensionality: can’t represent &lt;em&gt;everything&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Require separate handling for different modalities (text vs images)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context window limitations:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Long documents may get truncated or ignored&lt;/li&gt;
&lt;li&gt;Memory is &lt;em&gt;not&lt;/em&gt; persistent - everything resets after a session unless you manually re-include previous context&lt;/li&gt;
&lt;li&gt;More tokens = higher latency and cost&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These limits are precisely why so much effort goes into &lt;strong&gt;enhancing&lt;/strong&gt; LLMs through fine-tuning, retrieval systems, and smarter prompt engineering.&lt;/p&gt;
&lt;p&gt;We’ll dive into that next.&lt;/p&gt;
&lt;h2&gt;Recap: Key Concepts from This Post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;Vector representations of tokens/text&lt;/td&gt;
&lt;td&gt;Enable semantic understanding &amp;amp; search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector Space&lt;/td&gt;
&lt;td&gt;Mathematical space where embeddings live&lt;/td&gt;
&lt;td&gt;Allows similarity comparison &amp;amp; clustering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Window&lt;/td&gt;
&lt;td&gt;Max token size per LLM input&lt;/td&gt;
&lt;td&gt;Defines how much the model can “see”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention&lt;/td&gt;
&lt;td&gt;Weighs token relationships dynamically&lt;/td&gt;
&lt;td&gt;Enables context awareness in LLMs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;🔮 Up Next: Making LLMs Smarter with Fine-Tuning, Prompt Engineering, and RAG&lt;/h2&gt;
&lt;p&gt;In our next post, we’ll show how to &lt;strong&gt;enhance LLM performance&lt;/strong&gt; using proven techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fine-tuning&lt;/li&gt;
&lt;li&gt;Prompt engineering&lt;/li&gt;
&lt;li&gt;Retrieval-Augmented Generation (RAG)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These strategies help you move beyond limitations - and get the most out of your models.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 1 - What Is AI and How It Evolved Into LLMs</title><link>https://iceberglakehouse.com/posts/2025-04-What-is-AI-and-How-It-Evolved-Into-LLMs/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-What-is-AI-and-How-It-Evolved-Into-LLMs/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-01/).

## Free Res...</description><pubDate>Sat, 05 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-04-AI-Agents-MCP-01/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Artificial Intelligence (AI) has become the defining technology of the decade. From chatbots to code generators, from self-driving cars to predictive text - AI systems are everywhere. But before we dive into the cutting-edge world of large language models (LLMs), let’s rewind and understand where this all began.&lt;/p&gt;
&lt;p&gt;This post kicks off our 10-part series exploring how AI evolved into LLMs, how to enhance their capabilities, and how the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; is shaping the future of intelligent, modular agents.&lt;/p&gt;
&lt;h2&gt;🧠 A Brief History of AI&lt;/h2&gt;
&lt;p&gt;The term &amp;quot;Artificial Intelligence&amp;quot; was coined in 1956, but the idea has been around even longer - think mechanical automatons and Alan Turing’s famous question: &lt;em&gt;&amp;quot;Can machines think?&amp;quot;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;AI development has gone through several distinct waves:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Symbolic AI (1950s–1980s)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Also known as &amp;quot;Good Old-Fashioned AI,&amp;quot; symbolic systems were rule-based. Think expert systems, logic programming, and hand-coded decision trees. These systems could play chess or diagnose medical conditions - if you wrote enough rules.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;: Rigid, brittle, and poor at handling ambiguity.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Machine Learning (1990s–2010s)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Instead of coding rules manually, we trained models to recognize patterns from data. Algorithms like decision trees, support vector machines, and early neural networks emerged.&lt;/p&gt;
&lt;p&gt;This era gave us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Spam filters&lt;/li&gt;
&lt;li&gt;Fraud detection&lt;/li&gt;
&lt;li&gt;Recommendation engines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But while powerful, these models still had a hard time with natural language and context.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Deep Learning (2010s–Now)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;With more data, better algorithms, and stronger GPUs, neural networks started outperforming traditional methods. Deep learning led to breakthroughs in:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image recognition (CNNs)&lt;/li&gt;
&lt;li&gt;Speech recognition (RNNs, LSTMs)&lt;/li&gt;
&lt;li&gt;Language understanding (Transformers)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And that brings us to the latest evolution...&lt;/p&gt;
&lt;h2&gt;🧬 Enter LLMs: The Rise of Language-First AI&lt;/h2&gt;
&lt;p&gt;Large Language Models (LLMs) like GPT-4, Claude, and Gemini aren’t just another step in AI - they represent a leap. Trained on massive text corpora using &lt;strong&gt;transformer architectures&lt;/strong&gt;, these models can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Write essays and poems&lt;/li&gt;
&lt;li&gt;Generate and debug code&lt;/li&gt;
&lt;li&gt;Translate between languages&lt;/li&gt;
&lt;li&gt;Answer complex questions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All by predicting the next word in a sentence.&lt;/p&gt;
&lt;p&gt;But what makes LLMs so powerful?&lt;/p&gt;
&lt;h2&gt;🏗️ LLMs Are More Than Just Big Neural Nets&lt;/h2&gt;
&lt;p&gt;At their core, LLMs are massive deep learning models that turn &lt;strong&gt;tokens (words/pieces of words)&lt;/strong&gt; into &lt;strong&gt;vectors (mathematical representations)&lt;/strong&gt;. Through billions of parameters, they learn the structure of language and the latent meaning within it.&lt;/p&gt;
&lt;p&gt;Key components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tokenization&lt;/strong&gt;: Breaking input into chunks the model can process&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;: Mapping tokens to vector space&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attention Mechanisms&lt;/strong&gt;: Letting the model focus on relevant parts of the input&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context Window&lt;/strong&gt;: A memory buffer for how much input the model can “see”&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Popular LLMs:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;th&gt;Notable Feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Up to 128k&lt;/td&gt;
&lt;td&gt;Code + natural language synergy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Up to 200k&lt;/td&gt;
&lt;td&gt;Strong at instruction following&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;Google DeepMind&lt;/td&gt;
&lt;td&gt;~32k+&lt;/td&gt;
&lt;td&gt;Multimodal capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;🧩 What LLMs Can (and Can’t) Do&lt;/h2&gt;
&lt;p&gt;LLMs are versatile and impressive - but they&apos;re not magic. Their strengths come with real limitations:&lt;/p&gt;
&lt;h3&gt;✅ What they’re great at:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Text generation and summarization&lt;/li&gt;
&lt;li&gt;Conversational interfaces&lt;/li&gt;
&lt;li&gt;Programming assistance&lt;/li&gt;
&lt;li&gt;Knowledge retrieval from training data&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;❌ What they struggle with:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: No persistent memory across sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context limits&lt;/strong&gt;: Can only “see” a fixed number of tokens&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;: Struggles with complex multi-step logic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time data&lt;/strong&gt;: Can’t access up-to-date or private information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action-taking&lt;/strong&gt;: Can&apos;t interact with tools or APIs by default&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where the next evolution comes in: &lt;strong&gt;augmenting LLMs&lt;/strong&gt; with context, tools, and workflows.&lt;/p&gt;
&lt;h2&gt;🔮 The Road Ahead: From Models to Modular AI Agents&lt;/h2&gt;
&lt;p&gt;We’ve gone from rules to learning, from deep learning to LLMs - but we’re not done yet. The future of AI lies in making LLMs &lt;em&gt;do more than just talk&lt;/em&gt;. We need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Give them memory&lt;/li&gt;
&lt;li&gt;Let them interact with data&lt;/li&gt;
&lt;li&gt;Enable them to call tools, services, and APIs&lt;/li&gt;
&lt;li&gt;Help them make decisions and reason through complex tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This brings us to the idea of &lt;strong&gt;AI Agents&lt;/strong&gt; - autonomous systems built on LLMs that can perceive, decide, and act.&lt;/p&gt;
&lt;h3&gt;🧭 Coming Up Next&lt;/h3&gt;
&lt;p&gt;In our next post, we’ll explore &lt;strong&gt;how LLMs actually work&lt;/strong&gt; under the hood - digging into embeddings, vector spaces, and how models “understand” language.&lt;/p&gt;
&lt;p&gt;Stay tuned.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building a Basic MCP Server with Python</title><link>https://iceberglakehouse.com/posts/2025-04-basics-of-making-mcp-server/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-basics-of-making-mcp-server/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 04 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=mcp_basic&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=mcp_basic&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you’ve ever wished you could ask an AI model like Claude to interact with your local files or run custom code - good news: &lt;strong&gt;you can.&lt;/strong&gt; That’s exactly what the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; makes possible.&lt;/p&gt;
&lt;p&gt;In this tutorial, we’ll walk you through building a beginner-friendly &lt;strong&gt;MCP server&lt;/strong&gt; that acts as a simple template for future projects. You don’t need to be an expert in AI or server development - we’ll explain each part as we go.&lt;/p&gt;
&lt;p&gt;Here’s what we’ll build:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A small server using Python and the &lt;strong&gt;MCP SDK&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Two useful &lt;strong&gt;tools&lt;/strong&gt; that read data from:
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;CSV file&lt;/strong&gt; (great for spreadsheets and tabular data)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Parquet file&lt;/strong&gt; (a format often used in data engineering and analytics)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;A clean folder structure that makes it easy to add new tools or features later&lt;/li&gt;
&lt;li&gt;A working connection to &lt;strong&gt;Claude for Desktop&lt;/strong&gt;, so you can ask things like:
&lt;blockquote&gt;
&lt;p&gt;“Summarize the contents of my data file”&lt;br&gt;
“How many rows and columns are in this CSV?”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why Start Here?&lt;/h3&gt;
&lt;p&gt;This blog is perfect for you if:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’ve heard about Claude and want to connect it to your own tools or data&lt;/li&gt;
&lt;li&gt;You’re curious about MCP and want to see how it works in practice&lt;/li&gt;
&lt;li&gt;You’d like a solid starting point for building more advanced tool servers later&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll use plain Python and some common libraries like &lt;code&gt;pandas&lt;/code&gt;, with no web frameworks or deployment complexity. Everything will run locally on your machine.&lt;/p&gt;
&lt;p&gt;By the end, you’ll have a fully working &lt;strong&gt;local MCP server&lt;/strong&gt; and a better understanding of how to make AI tools that go beyond text prediction - and actually do useful work.&lt;/p&gt;
&lt;p&gt;Let’s get started!&lt;/p&gt;
&lt;h2&gt;What Is MCP (and Why Should You Care)?&lt;/h2&gt;
&lt;p&gt;Let’s break this down before we start writing code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MCP&lt;/strong&gt; stands for &lt;strong&gt;Model Context Protocol&lt;/strong&gt;. It’s a way to let apps like Claude for Desktop securely interact with &lt;strong&gt;external data&lt;/strong&gt; and &lt;strong&gt;custom tools&lt;/strong&gt; that you define.&lt;/p&gt;
&lt;p&gt;Think of it like building your own mini API - but instead of exposing it to the whole internet, you’re exposing it to an AI assistant on your machine.&lt;/p&gt;
&lt;p&gt;With MCP, you can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let Claude read a file or query a database&lt;/li&gt;
&lt;li&gt;Create tools that do useful things (like summarize a dataset or fetch an API)&lt;/li&gt;
&lt;li&gt;Add reusable prompts to guide how Claude behaves in certain tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For this project, we’re focusing on &lt;strong&gt;tools&lt;/strong&gt;: the part of MCP that lets you write small Python functions the AI can call.&lt;/p&gt;
&lt;h3&gt;What We’re Building&lt;/h3&gt;
&lt;p&gt;Here’s a quick preview of what you’ll end up with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A local MCP server called &lt;code&gt;mix_server&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Two tools: one that reads a CSV file, and one that reads a Parquet file&lt;/li&gt;
&lt;li&gt;A clean, modular folder layout so you can keep adding more tools later&lt;/li&gt;
&lt;li&gt;A working connection to Claude for Desktop so you can talk to your tools through natural language&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s start by setting up your project.&lt;/p&gt;
&lt;h2&gt;Project Setup (Step-by-Step)&lt;/h2&gt;
&lt;p&gt;We’ll use &lt;a href=&quot;https://github.com/astral-sh/uv&quot;&gt;&lt;strong&gt;uv&lt;/strong&gt;&lt;/a&gt;: a fast, modern Python project manager, to create and manage our environment. It handles dependencies, virtual environments, and script execution, all in one place.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’ve used &lt;code&gt;pip&lt;/code&gt; or &lt;code&gt;virtualenv&lt;/code&gt; before, uv is like both of those combined - but much faster and more ergonomic.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Step 1: Install &lt;code&gt;uv&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;To install &lt;code&gt;uv&lt;/code&gt;, run this in your terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -LsSf https://astral.sh/uv/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then restart your terminal so the uv command is available.&lt;/p&gt;
&lt;p&gt;You can check that it&apos;s working with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv --version
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Create the Project&lt;/h3&gt;
&lt;p&gt;Let’s make a new folder for our MCP server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv init mix_server
cd mix_server
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a basic Python project with a pyproject.toml file to manage dependencies.&lt;/p&gt;
&lt;h3&gt;Step 3: Set Up a Virtual Environment&lt;/h3&gt;
&lt;p&gt;We’ll now create a virtual environment for our project and activate it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv venv
source .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This keeps your dependencies isolated from the rest of your system.&lt;/p&gt;
&lt;h3&gt;Step 4: Add Required Dependencies&lt;/h3&gt;
&lt;p&gt;We’re going to install three key packages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;mcp[cli]&lt;/code&gt;: The official MCP SDK and command-line tools&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pandas&lt;/code&gt;: For reading CSV and Parquet files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pyarrow&lt;/code&gt;: Adds support for reading Parquet files via Pandas&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Install them using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv add &amp;quot;mcp[cli]&amp;quot; pandas pyarrow
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This updates your pyproject.toml and installs the packages into your environment.&lt;/p&gt;
&lt;h3&gt;Step 5: Create a Clean Folder Structure&lt;/h3&gt;
&lt;p&gt;We’ll use the following layout to stay organized:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mix_server/
│
├── data/                 # Sample CSV and Parquet files
│
├── tools/                # MCP tool definitions
│
├── utils/                # Reusable file reading logic
│
├── server.py             # Creates the Server
├── main.py             # Entry point for the MCP server
└── README.md             # Optional documentation
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create the folders:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir data tools utils
touch server.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Your environment is now ready. In the next section, we’ll create a couple of small data files to work with: a CSV and a Parquet file, and use them to power our tools.&lt;/p&gt;
&lt;h2&gt;Creating Sample Data Files&lt;/h2&gt;
&lt;p&gt;To build our first tools, we need something for them to work with. In this section, we’ll create two simple files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;CSV file&lt;/strong&gt; (great for spreadsheets and tabular data)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Parquet file&lt;/strong&gt; (a more efficient format used in data engineering)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both files will contain the same mock dataset: a short list of users. You’ll use these files later when building tools that summarize their contents.&lt;/p&gt;
&lt;h3&gt;Step 1: Create the &lt;code&gt;data/&lt;/code&gt; Folder&lt;/h3&gt;
&lt;p&gt;If you haven’t already created the folder for our data, do it now from your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir data
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Create a Sample CSV File&lt;/h3&gt;
&lt;p&gt;Now let’s add a sample CSV file with some fake user data.&lt;/p&gt;
&lt;p&gt;Create a new file called sample.csv inside the data/ folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;data/sample.csv
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And paste the following into it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-csv&quot;&gt;id,name,email,signup_date
1,Alice Johnson,alice@example.com,2023-01-15
2,Bob Smith,bob@example.com,2023-02-22
3,Carol Lee,carol@example.com,2023-03-10
4,David Wu,david@example.com,2023-04-18
5,Eva Brown,eva@example.com,2023-05-30
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This file gives us structured, readable data - perfect for a tool to analyze.&lt;/p&gt;
&lt;h3&gt;Step 3: Convert the CSV to Parquet&lt;/h3&gt;
&lt;p&gt;We’ll now create a Parquet version of the same data using Python. This shows how easily you can support both file types in your tools.&lt;/p&gt;
&lt;p&gt;Create a short script in the root of your project called generate_parquet.py:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# generate_parquet.py

import pandas as pd

# Read the CSV
df = pd.read_csv(&amp;quot;data/sample.csv&amp;quot;)

# Save as Parquet
df.to_parquet(&amp;quot;data/sample.parquet&amp;quot;, index=False)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv run generate_parquet.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After this, your data/ folder should look like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;data/
├── sample.csv
└── sample.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;What’s the Difference Between CSV and Parquet?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CSV:&lt;/strong&gt; Simple, human-readable text file. Great for small datasets and quick inspection.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Parquet:&lt;/strong&gt; A binary, column-based format. Much faster for large datasets and common in analytics pipelines (e.g. with Apache Spark or Dremio).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Supporting both formats makes your tools more flexible, and this example shows how little extra effort it takes.&lt;/p&gt;
&lt;p&gt;Next, we’ll write some reusable utility functions that can read these files and return a quick summary of their contents - ready to be wrapped as MCP tools.&lt;/p&gt;
&lt;h2&gt;Writing Utility Functions to Read CSV and Parquet Files&lt;/h2&gt;
&lt;p&gt;Now that we have some data to work with, let’s write the core logic to read those files and return a basic summary.&lt;/p&gt;
&lt;p&gt;We’re going to put this logic in a separate Python file under a folder called &lt;code&gt;utils/&lt;/code&gt;. This makes it easy to reuse across different tools without duplicating code.&lt;/p&gt;
&lt;h3&gt;Step 1: Create the Utility Module&lt;/h3&gt;
&lt;p&gt;If you haven’t already created the &lt;code&gt;utils/&lt;/code&gt; folder, do it now:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir utils
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now create a new Python file inside it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;touch utils/file_reader.py
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Add File Reading Functions&lt;/h3&gt;
&lt;p&gt;Open utils/file_reader.py and paste in the following code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# utils/file_reader.py

import pandas as pd
from pathlib import Path

# Base directory where our data lives
DATA_DIR = Path(__file__).resolve().parent.parent / &amp;quot;data&amp;quot;

def read_csv_summary(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Read a CSV file and return a simple summary.

    Args:
        filename: Name of the CSV file (e.g. &apos;sample.csv&apos;)

    Returns:
        A string describing the file&apos;s contents.
    &amp;quot;&amp;quot;&amp;quot;
    file_path = DATA_DIR / filename
    df = pd.read_csv(file_path)
    return f&amp;quot;CSV file &apos;{filename}&apos; has {len(df)} rows and {len(df.columns)} columns.&amp;quot;

def read_parquet_summary(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Read a Parquet file and return a simple summary.

    Args:
        filename: Name of the Parquet file (e.g. &apos;sample.parquet&apos;)

    Returns:
        A string describing the file&apos;s contents.
    &amp;quot;&amp;quot;&amp;quot;
    file_path = DATA_DIR / filename
    df = pd.read_parquet(file_path)
    return f&amp;quot;Parquet file &apos;{filename}&apos; has {len(df)} rows and {len(df.columns)} columns.&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;How This Works&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We’re using &lt;code&gt;pandas&lt;/code&gt; to read both &lt;code&gt;CSV&lt;/code&gt; and &lt;code&gt;Parquet&lt;/code&gt; files. It’s a well-known data analysis library in Python.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pathlib.Path&lt;/code&gt; helps us safely construct file paths across operating systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both functions return a simple string like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;CSV file &apos;sample.csv&apos; has 5 rows and 4 columns.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is all the logic our tools will need to start with. Later, if you want to add more advanced summaries: like listing column names or detecting null values, you can expand these functions.&lt;/p&gt;
&lt;p&gt;With our utilities ready, we can now expose them as MCP tools - so Claude can actually use them!&lt;/p&gt;
&lt;h2&gt;Wrapping File Readers as MCP Tools&lt;/h2&gt;
&lt;p&gt;Now that we’ve written the logic to read and summarize our data files, it’s time to make those functions available to Claude through &lt;strong&gt;MCP tools&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;What’s an MCP Tool?&lt;/h3&gt;
&lt;p&gt;An &lt;strong&gt;MCP tool&lt;/strong&gt; is a Python function you register with your MCP server that the AI can call when it needs to take action - like reading a file, querying an API, or performing a calculation.&lt;/p&gt;
&lt;p&gt;To register a tool, you decorate the function with &lt;code&gt;@mcp.tool()&lt;/code&gt;. Behind the scenes, MCP generates a definition that the AI can see and interact with.&lt;/p&gt;
&lt;p&gt;But before we do that, let’s follow a best practice: &lt;strong&gt;we’ll define our MCP server instance in one central place&lt;/strong&gt;, then import it into each file that defines tools. This ensures everything stays clean and consistent.&lt;/p&gt;
&lt;h3&gt;Step 1: Define the MCP Server Instance&lt;/h3&gt;
&lt;p&gt;Open your &lt;code&gt;server.py&lt;/code&gt; and &lt;code&gt;main.py&lt;/code&gt; files (or create it if you haven’t already), and add the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# server.py

from mcp.server.fastmcp import FastMCP

# This is the shared MCP server instance
mcp = FastMCP(&amp;quot;mix_server&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from server import mcp

# Entry point to run the server
if __name__ == &amp;quot;__main__&amp;quot;:
    mcp.run()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a named server called &amp;quot;mix_server&amp;quot; and exposes a simple run command.&lt;/p&gt;
&lt;h3&gt;Step 2: Create the CSV Tool&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Let’s now define our first tool:&lt;/strong&gt; one that summarizes a CSV file.&lt;/p&gt;
&lt;p&gt;Create a new file called &lt;code&gt;csv_tools.py&lt;/code&gt; inside the &lt;code&gt;tools/&lt;/code&gt; folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;touch tools/csv_tools.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# tools/csv_tools.py

from server import mcp
from utils.file_reader import read_csv_summary

@mcp.tool()
def summarize_csv_file(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Summarize a CSV file by reporting its number of rows and columns.

    Args:
        filename: Name of the CSV file in the /data directory (e.g., &apos;sample.csv&apos;)

    Returns:
        A string describing the file&apos;s dimensions.
    &amp;quot;&amp;quot;&amp;quot;
    return read_csv_summary(filename)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Create the Parquet Tool&lt;/h3&gt;
&lt;p&gt;Now let’s do the same for a Parquet file.&lt;/p&gt;
&lt;p&gt;Create a file called &lt;code&gt;parquet_tools.py&lt;/code&gt; inside the &lt;code&gt;tools/&lt;/code&gt; folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;touch tools/parquet_tools.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And add:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# tools/parquet_tools.py

from server import mcp
from utils.file_reader import read_parquet_summary

@mcp.tool()
def summarize_parquet_file(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Summarize a Parquet file by reporting its number of rows and columns.

    Args:
        filename: Name of the Parquet file in the /data directory (e.g., &apos;sample.parquet&apos;)

    Returns:
        A string describing the file&apos;s dimensions.
    &amp;quot;&amp;quot;&amp;quot;
    return read_parquet_summary(filename)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Register the Tools&lt;/h3&gt;
&lt;p&gt;Since the tools are registered via decorators at import time, we just need to make sure the server.py file imports the tool modules. Add these lines at the top of server.py:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# main.py

from server import mcp

# Import tools so they get registered via decorators
import tools.csv_tools
import tools.parquet_tools

# Entry point to run the server
if __name__ == &amp;quot;__main__&amp;quot;:
    mcp.run()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, whenever the server runs, it automatically registers all tools via the @mcp.tool() decorators.&lt;/p&gt;
&lt;p&gt;Your tools are now live! In the next section, we’ll walk through how to run the server and connect it to Claude for Desktop so you can test them out in natural language.&lt;/p&gt;
&lt;h2&gt;Running and Testing Your MCP Server with Claude for Desktop&lt;/h2&gt;
&lt;p&gt;At this point, you’ve built a functional MCP server with two tools: one for reading CSV files and another for Parquet. Now it’s time to bring it to life and connect it to &lt;strong&gt;Claude for Desktop&lt;/strong&gt;, so you can start running your tools using plain English.&lt;/p&gt;
&lt;h3&gt;Step 1: Run the Server&lt;/h3&gt;
&lt;p&gt;Let’s start your server locally.&lt;/p&gt;
&lt;p&gt;In your project root (where &lt;code&gt;server.py&lt;/code&gt; lives), run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv run main.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This starts your MCP server using the tools you defined. You won’t see much output in the terminal just yet, that’s normal. Your server is now waiting for a connection from a client like Claude.&lt;/p&gt;
&lt;h3&gt;Step 2: Install Claude for Desktop (If You Haven’t Already)&lt;/h3&gt;
&lt;p&gt;You’ll need Claude for Desktop installed to connect to your server.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Download it here:&lt;/strong&gt; https://www.anthropic.com/claude&lt;/p&gt;
&lt;p&gt;Follow the installation instructions for your operating system&lt;/p&gt;
&lt;p&gt;Note: As of now, Claude for Desktop is not available on Linux. If you’re on Linux, skip ahead to the section on building your own MCP client.&lt;/p&gt;
&lt;h3&gt;Step 3: Configure Claude to Use Your Server&lt;/h3&gt;
&lt;p&gt;Claude needs to know where to find your MCP server. You’ll do this by editing a small config file on your system.&lt;/p&gt;
&lt;h4&gt;MacOS / Linux:&lt;/h4&gt;
&lt;p&gt;Open this file in your code editor (create it if it doesn’t exist):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;code ~/Library/Application\ Support/Claude/claude_desktop_config.json
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Windows:&lt;/h4&gt;
&lt;p&gt;The config file is located here:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;%APPDATA%\Claude\claude_desktop_config.json
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Add Your Server to the Config&lt;/h3&gt;
&lt;p&gt;Paste the following JSON into the file, replacing the &amp;quot;/ABSOLUTE/PATH/...&amp;quot; with the actual full path to your mix_server project folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;mix_server&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/ABSOLUTE/PATH/TO/mix_server&amp;quot;,
        &amp;quot;run&amp;quot;,
        &amp;quot;main.py&amp;quot;
      ]
    }
  }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tip: To find the absolute path:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On Mac/Linux:&lt;/strong&gt; Run pwd in your terminal&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On Windows:&lt;/strong&gt; Use cd and copy the full path from File Explorer&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Make sure uv is in your system PATH, or replace &amp;quot;command&amp;quot;:&lt;/strong&gt; &amp;quot;uv&amp;quot; with the full path to the uv executable.&lt;/p&gt;
&lt;h3&gt;Step 5: Restart Claude for Desktop&lt;/h3&gt;
&lt;p&gt;Restart the app, and you should see a new tool icon (hammer) appear in the interface. Click it, and you’ll see your registered tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;summarize_csv_file&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;summarize_parquet_file&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These can now be called directly by the AI!&lt;/p&gt;
&lt;h3&gt;Step 6: Try It Out&lt;/h3&gt;
&lt;p&gt;Now try asking Claude something like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Summarize the CSV file named sample.csv.&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;How many rows are in sample.parquet?&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Claude will detect the appropriate tool, call your server, and respond with the results - powered by the very Python code you wrote.&lt;/p&gt;
&lt;h3&gt;Troubleshooting Tips&lt;/h3&gt;
&lt;p&gt;If things don’t work right away, here are a few things to check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Make sure your &lt;code&gt;uv run main.py&lt;/code&gt; process is running and hasn&apos;t crashed&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Ensure the file paths in your config JSON are correct&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Confirm that your data files (&lt;code&gt;sample.csv&lt;/code&gt;, &lt;code&gt;sample.parquet&lt;/code&gt;) exist in the &lt;code&gt;/data&lt;/code&gt; directory&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check the Claude UI for error messages or tool-loading indicators&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You now have a working local AI toolchain powered by MCP! In the final section, we’ll do a quick recap and show how you can build on this template for more powerful tools.&lt;/p&gt;
&lt;h2&gt;Recap and Next Steps&lt;/h2&gt;
&lt;p&gt;Congratulations - you just built your first MCP server!&lt;/p&gt;
&lt;p&gt;Let’s take a moment to review what you’ve accomplished.&lt;/p&gt;
&lt;h3&gt;What You Built&lt;/h3&gt;
&lt;p&gt;By following this guide, you now have a fully working &lt;strong&gt;MCP server&lt;/strong&gt; that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uses Python and the official &lt;code&gt;mcp&lt;/code&gt; SDK&lt;/li&gt;
&lt;li&gt;Reads real data from both &lt;strong&gt;CSV&lt;/strong&gt; and &lt;strong&gt;Parquet&lt;/strong&gt; files&lt;/li&gt;
&lt;li&gt;Exposes two custom &lt;strong&gt;MCP tools&lt;/strong&gt; that Claude for Desktop can call:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;summarize_csv_file&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;summarize_parquet_file&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Follows a clean, modular folder structure&lt;/li&gt;
&lt;li&gt;Runs locally using &lt;code&gt;uv&lt;/code&gt; and connects seamlessly to Claude for natural language interaction&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;You also learned how to:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Set up your Python project with &lt;code&gt;uv&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Manage dependencies cleanly&lt;/li&gt;
&lt;li&gt;Register and expose tools using the &lt;code&gt;@mcp.tool()&lt;/code&gt; decorator&lt;/li&gt;
&lt;li&gt;Wire everything together with Claude through a simple config file&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Where to Go From Here&lt;/h3&gt;
&lt;p&gt;This project was intentionally simple so you could focus on learning the structure and flow of an MCP server. But this is just the beginning.&lt;/p&gt;
&lt;p&gt;Here are a few ideas for extending this template:&lt;/p&gt;
&lt;h4&gt;1. &lt;strong&gt;Add More Advanced Tools&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Try building tools that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Filter rows based on a column value&lt;/li&gt;
&lt;li&gt;Return column names or data types&lt;/li&gt;
&lt;li&gt;Calculate statistics (mean, median, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;2. &lt;strong&gt;Use Resources&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Use &lt;code&gt;@mcp.resource()&lt;/code&gt; to expose static or dynamic data that Claude can pull into its context before making a decision.&lt;/p&gt;
&lt;h4&gt;3. &lt;strong&gt;Explore Prompts&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Create reusable interaction templates with &lt;code&gt;@mcp.prompt()&lt;/code&gt; to guide how Claude asks or responds.&lt;/p&gt;
&lt;h4&gt;4. &lt;strong&gt;Add Async Logic&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;If you’re pulling data from APIs or databases, consider making your tools async using &lt;code&gt;async def&lt;/code&gt; - fully supported by FastMCP.&lt;/p&gt;
&lt;h4&gt;5. &lt;strong&gt;Build Your Own Client&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Not using Claude? You can write your own MCP-compatible client using the SDK’s &lt;code&gt;ClientSession&lt;/code&gt; interface.&lt;/p&gt;
&lt;h3&gt;Share and Reuse&lt;/h3&gt;
&lt;p&gt;You now have a &lt;strong&gt;template&lt;/strong&gt; you can reuse for future projects. If you publish it on GitHub, others can fork it, extend it, and learn from it too.&lt;/p&gt;
&lt;p&gt;This isn’t just a demo - it’s the foundation of a toolchain where you can define your own AI-powered workflows and expose them to LLMs in a controlled, modular way.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using Helm with Kubernetes - A Guide to Helm Charts and Their Implementation</title><link>https://iceberglakehouse.com/posts/2025-02-using-helm-with-kubernetes/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-02-using-helm-with-kubernetes/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Wed, 19 Feb 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=using_helm_charts&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Managing applications in Kubernetes can be complex, requiring multiple YAML files to define resources such as Deployments, Services, ConfigMaps, and Secrets. As applications scale, maintaining and updating these configurations manually becomes cumbersome and error-prone. This is where &lt;strong&gt;Helm&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;p&gt;Helm is a &lt;strong&gt;package manager for Kubernetes&lt;/strong&gt; that simplifies deployment by bundling application configurations into reusable, version-controlled &lt;strong&gt;Helm charts&lt;/strong&gt;. With Helm, you can deploy applications with a single command, manage updates seamlessly, and roll back to previous versions if needed.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Why Use Helm?&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplifies Deployments&lt;/strong&gt; – Deploy complex applications with a single command instead of managing multiple YAML files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterization &amp;amp; Reusability&lt;/strong&gt; – Configure deployments dynamically using &lt;code&gt;values.yaml&lt;/code&gt;, making it easy to manage multiple environments (dev, staging, prod).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control &amp;amp; Rollbacks&lt;/strong&gt; – Helm tracks deployments, allowing you to roll back to previous versions in case of failures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency Management&lt;/strong&gt; – Install and manage application dependencies effortlessly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integration with CI/CD &amp;amp; GitOps&lt;/strong&gt; – Automate deployments with tools like &lt;strong&gt;ArgoCD&lt;/strong&gt;, &lt;strong&gt;FluxCD&lt;/strong&gt;, and &lt;strong&gt;GitHub Actions&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;What You&apos;ll Learn in This Guide&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In this blog, we’ll cover:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What Helm is and how it works&lt;/strong&gt; – Understanding its architecture and components.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Installing and configuring Helm&lt;/strong&gt; – Setting up Helm for your Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Understanding Helm charts&lt;/strong&gt; – Exploring chart structure, templates, and values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing your own Helm chart&lt;/strong&gt; – Step-by-step guide to creating a custom chart.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploying applications with Helm&lt;/strong&gt; – Installing, upgrading, and rolling back releases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Best practices for Helm in production&lt;/strong&gt; – Security, GitOps integration, and monitoring.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By the end of this guide, you&apos;ll have a strong foundation in Helm and be able to deploy, manage, and scale Kubernetes applications efficiently.&lt;/p&gt;
&lt;h2&gt;Understanding Helm: The Package Manager for Kubernetes&lt;/h2&gt;
&lt;h3&gt;What is Helm?&lt;/h3&gt;
&lt;p&gt;Helm is a &lt;strong&gt;package manager for Kubernetes&lt;/strong&gt; that helps deploy, configure, and manage applications in a Kubernetes cluster. Instead of manually writing and applying multiple Kubernetes YAML manifests, Helm allows you to package them into reusable &lt;strong&gt;Helm Charts&lt;/strong&gt;, simplifying deployment and maintenance.&lt;/p&gt;
&lt;h3&gt;Why Use Helm?&lt;/h3&gt;
&lt;p&gt;Managing Kubernetes resources can become complex, especially when deploying applications with multiple components (Deployments, Services, ConfigMaps, Secrets, etc.). Helm provides several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplifies Deployments&lt;/strong&gt; – Automates the process of applying multiple YAML files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning &amp;amp; Rollbacks&lt;/strong&gt; – Tracks different versions of deployments and allows rollback if necessary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterization &amp;amp; Reusability&lt;/strong&gt; – Uses a templating system (&lt;code&gt;values.yaml&lt;/code&gt;) to customize deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency Management&lt;/strong&gt; – Simplifies installing and upgrading application dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistent Configuration Across Environments&lt;/strong&gt; – Makes it easy to manage different configurations for dev, staging, and production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How Does Helm Compare to Traditional Kubernetes Manifests?&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Kubernetes YAML Manifests&lt;/th&gt;
&lt;th&gt;Helm Charts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Management&lt;/td&gt;
&lt;td&gt;Requires manually applying multiple YAML files&lt;/td&gt;
&lt;td&gt;Uses a single Helm command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Static YAML definitions&lt;/td&gt;
&lt;td&gt;Dynamic templating via &lt;code&gt;values.yaml&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Version Control&lt;/td&gt;
&lt;td&gt;Difficult to track changes manually&lt;/td&gt;
&lt;td&gt;Built-in versioning &amp;amp; rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reusability&lt;/td&gt;
&lt;td&gt;Limited; each deployment needs its own YAML&lt;/td&gt;
&lt;td&gt;Reusable and configurable charts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;Managed manually&lt;/td&gt;
&lt;td&gt;Handled via &lt;code&gt;requirements.yaml&lt;/code&gt; (deprecated) or &lt;code&gt;Chart.yaml&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;How Helm Works&lt;/h2&gt;
&lt;h3&gt;Helm Components and Architecture&lt;/h3&gt;
&lt;p&gt;Helm follows a client-only architecture in &lt;strong&gt;Helm v3&lt;/strong&gt;, where it directly interacts with the Kubernetes API server without requiring a backend component like Tiller (which was used in Helm v2). Below are the core components of Helm:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Helm CLI&lt;/strong&gt; – The command-line interface used to manage Helm charts, releases, and repositories.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Charts&lt;/strong&gt; – Packaged Kubernetes applications that define resources like Deployments, Services, ConfigMaps, and Secrets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Repository&lt;/strong&gt; – A collection of Helm charts stored in a remote or local location (e.g., &lt;a href=&quot;https://artifacthub.io/&quot;&gt;Artifact Hub&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Release&lt;/strong&gt; – A deployed instance of a Helm chart, stored as metadata inside the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kubernetes API Server&lt;/strong&gt; – Helm interacts with the Kubernetes API to apply resources as defined in the chart.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Helm Workflow: How Helm Manages Deployments&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Fetching Charts&lt;/strong&gt; – Helm can pull pre-built charts from repositories using &lt;code&gt;helm repo add&lt;/code&gt; and &lt;code&gt;helm search repo&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Templating and Rendering&lt;/strong&gt; – Helm dynamically replaces values in the YAML templates using the &lt;code&gt;values.yaml&lt;/code&gt; file before applying them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Creating a Release&lt;/strong&gt; – When a Helm chart is installed, Helm assigns it a unique &lt;strong&gt;release name&lt;/strong&gt; and applies the rendered templates to the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning and Rollbacks&lt;/strong&gt; – Helm maintains a history of releases, allowing easy upgrades (&lt;code&gt;helm upgrade&lt;/code&gt;) and rollbacks (&lt;code&gt;helm rollback&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uninstalling Releases&lt;/strong&gt; – Helm can remove all associated Kubernetes resources using &lt;code&gt;helm uninstall&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Helm Command Lifecycle&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm repo add &amp;lt;repo-name&amp;gt; &amp;lt;repo-url&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adds a Helm chart repository&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm search repo &amp;lt;keyword&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Searches for a chart in repositories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm install &amp;lt;release-name&amp;gt; &amp;lt;chart-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Installs a Helm chart and creates a release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm list&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lists all active Helm releases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm status &amp;lt;release-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows details of a deployed release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm upgrade &amp;lt;release-name&amp;gt; &amp;lt;chart-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Upgrades an existing release to a new chart version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm rollback &amp;lt;release-name&amp;gt; &amp;lt;revision&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rolls back a release to a previous version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm uninstall &amp;lt;release-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deletes a release and removes associated resources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Helm in Action: A Simple Example&lt;/h3&gt;
&lt;p&gt;Let&apos;s say you want to deploy &lt;strong&gt;NGINX&lt;/strong&gt; using Helm. You can do this with a single command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-nginx bitnami/nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adds the Bitnami Helm repository.&lt;/li&gt;
&lt;li&gt;Installs the NGINX Helm chart from the Bitnami repository.&lt;/li&gt;
&lt;li&gt;Creates a Helm release named my-nginx in the cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To check the status of the deployment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
helm status my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To uninstall the release:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Installing and Configuring Helm&lt;/h2&gt;
&lt;p&gt;Before using Helm, you need to install it on your local machine and configure it to work with your Kubernetes cluster. This section will walk through the installation process and initial setup.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Kubernetes cluster&lt;/strong&gt; running locally (e.g., Minikube, Kind) or in the cloud (e.g., AKS, GKE, EKS).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl&lt;/code&gt; installed and configured to communicate with your cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;Installing Helm&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Helm can be installed on macOS, Linux, and Windows using various package managers.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;macOS (Using Homebrew)&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;brew install helm
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Linux (Using Script)&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Windows (Using Chocolatey)&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;choco install kubernetes-helm
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verifying the Installation&lt;/h3&gt;
&lt;p&gt;After installation, verify that Helm is installed correctly by running:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see output similar to:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-js&quot;&gt;version.BuildInfo{Version:&amp;quot;v3.x.x&amp;quot;, GitCommit:&amp;quot;...&amp;quot;, GitTreeState:&amp;quot;clean&amp;quot;, GoVersion:&amp;quot;...&amp;quot;}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configuring Helm&lt;/h3&gt;
&lt;h4&gt;Adding a Helm Repository&lt;/h4&gt;
&lt;p&gt;Helm uses repositories to store charts. You can add a popular repository, such as the Bitnami Helm charts, using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To confirm the repository has been added:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo list
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Updating Helm Repositories&lt;/h4&gt;
&lt;p&gt;To fetch the latest charts from all added repositories, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo update
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Searching for Helm Charts&lt;/h4&gt;
&lt;p&gt;To search for a specific application within your configured repositories:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm search repo nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Installing a Helm Chart&lt;/h4&gt;
&lt;p&gt;Once Helm is set up, you can deploy an application. For example, to deploy NGINX using the Bitnami Helm chart:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx bitnami/nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the NGINX chart.&lt;/li&gt;
&lt;li&gt;Deploy the necessary Kubernetes resources.&lt;/li&gt;
&lt;li&gt;Assign the release name my-nginx.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Checking the Installation&lt;/h4&gt;
&lt;p&gt;List all active Helm releases:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Check the status of a specific release:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm status my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Uninstalling a Helm Release&lt;/h4&gt;
&lt;p&gt;To remove the my-nginx release and all associated resources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Understanding Helm Charts&lt;/h2&gt;
&lt;h3&gt;What is a Helm Chart?&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;Helm chart&lt;/strong&gt; is a packaged application definition that contains Kubernetes resource templates and default configuration values. It allows you to deploy complex applications with a single command while keeping configurations modular and reusable.&lt;/p&gt;
&lt;p&gt;Each chart defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What Kubernetes resources to deploy&lt;/strong&gt; (e.g., Deployments, Services, ConfigMaps).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How those resources should be configured&lt;/strong&gt; using a parameterized values file (&lt;code&gt;values.yaml&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependencies and metadata&lt;/strong&gt; required for installation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Structure of a Helm Chart&lt;/h3&gt;
&lt;p&gt;When you create a Helm chart, it follows a specific directory structure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mychart/
│── charts/           # Directory for chart dependencies (other charts)
│── templates/        # Contains Kubernetes YAML templates
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── _helpers.tpl  # Contains reusable template functions
│── Chart.yaml        # Metadata about the chart (name, version, description)
│── values.yaml       # Default configuration values for the chart
│── README.md         # Documentation about the chart

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each file in this structure serves a specific purpose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;Chart.yaml&lt;/code&gt;&lt;/strong&gt; – Contains metadata such as chart name, version, and description.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;values.yaml&lt;/code&gt;&lt;/strong&gt; – Defines default values that can be overridden during installation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;templates/&lt;/code&gt;&lt;/strong&gt; – Holds Kubernetes manifest templates using Helm’s templating syntax.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;charts/&lt;/code&gt;&lt;/strong&gt; – Stores dependencies (other charts required for deployment).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/strong&gt; – Documents how to use the chart.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example: &lt;code&gt;Chart.yaml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;Chart.yaml&lt;/code&gt; file provides information about the chart:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: v2
name: mychart
description: A sample Helm chart for Kubernetes
type: application
version: 1.0.0
appVersion: 1.16.0
name: The chart&apos;s name.
description: A brief description of what the chart does.
version: The chart version (used for versioning updates).
appVersion: The application version the chart deploys.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example: &lt;code&gt;values.yaml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The values.yaml file defines default configuration values:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 2

image:
  repository: nginx
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These values can be overridden when installing the chart using the &lt;code&gt;--set&lt;/code&gt; flag or a custom values file.&lt;/p&gt;
&lt;h3&gt;Example: &lt;code&gt;templates/deployment.yaml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;A sample Kubernetes Deployment template using Helm&apos;s templating syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-nginx
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: &amp;quot;{{ .Values.image.repository }}:{{ .Values.image.tag }}&amp;quot;
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this template:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{{ .Release.Name }}&lt;/code&gt; dynamically sets the release name.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.replicaCount }}&lt;/code&gt; pulls values from values.yaml.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.image.repository }}:{{ .Values.image.tag }}&lt;/code&gt; sets the container image dynamically.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Rendering Helm Templates&lt;/h3&gt;
&lt;p&gt;Before applying a Helm chart, you can preview how the templates will render using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm template mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Writing Your Own Helm Chart&lt;/h2&gt;
&lt;p&gt;Now that we understand Helm charts and their structure, let’s walk through the process of creating a custom Helm chart from scratch.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Step 1: Create a New Helm Chart&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;To generate a new Helm chart, use the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm create mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command creates a new directory mychart/ with the standard Helm chart structure.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Step 2: Modify values.yaml&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Open &lt;code&gt;values.yaml&lt;/code&gt; and update it with custom values. Let’s modify it to deploy an NGINX web server with a LoadBalancer service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3

image:
  repository: nginx
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: LoadBalancer
  port: 80
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;replicaCount:&lt;/strong&gt; Defines how many replicas the deployment will create.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;image:&lt;/strong&gt; Configures the container image.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;service:&lt;/strong&gt; Sets the service type and port.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Customize Deployment Template&lt;/h3&gt;
&lt;p&gt;Edit &lt;code&gt;templates/deployment.yaml&lt;/code&gt; to use Helm’s templating syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-nginx
  labels:
    app: nginx
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: &amp;quot;{{ .Values.image.repository }}:{{ .Values.image.tag }}&amp;quot;
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{{ .Release.Name }}&lt;/code&gt; dynamically assigns the release name.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.replicaCount }}&lt;/code&gt; references values from values.yaml.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.image.repository }}:{{ .Values.image.tag }}&lt;/code&gt; configures the image dynamically.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Customize the Service Template&lt;/h3&gt;
&lt;p&gt;Edit &lt;code&gt;templates/service.yaml&lt;/code&gt; to configure the service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: v1
kind: Service
metadata:
  name: {{ .Release.Name }}-nginx
spec:
  type: {{ .Values.service.type }}
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: {{ .Values.service.port }}
      targetPort: {{ .Values.service.port }}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 5: Package the Helm Chart&lt;/h3&gt;
&lt;p&gt;Once you&apos;ve modified the necessary files, package the chart:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm package mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a &lt;code&gt;.tgz&lt;/code&gt; archive of the chart, making it ready for distribution.&lt;/p&gt;
&lt;h3&gt;Step 6: Install the Chart&lt;/h3&gt;
&lt;p&gt;Deploy the chart to your Kubernetes cluster:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx ./mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Parses templates.&lt;/li&gt;
&lt;li&gt;Replaces placeholders with values from values.yaml.&lt;/li&gt;
&lt;li&gt;Applies the resources to Kubernetes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 7: Verify the Deployment&lt;/h3&gt;
&lt;p&gt;Check the deployed resources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
kubectl get pods
kubectl get svc
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 8: Uninstall the Chart&lt;/h3&gt;
&lt;p&gt;To remove the deployment, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Deploying Applications with Helm&lt;/h2&gt;
&lt;p&gt;Once you&apos;ve created or downloaded a Helm chart, you can use Helm to deploy and manage applications in your Kubernetes cluster. This section will walk through the deployment process, including installation, upgrades, rollbacks, and uninstallation.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Step 1: Installing a Helm Chart&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;To deploy an application using Helm, use the &lt;code&gt;helm install&lt;/code&gt; command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx ./mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;my-nginx&lt;/code&gt; is the release name (a unique identifier for this deployment).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;./mychart&lt;/code&gt; is the path to the Helm chart.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are installing a chart from a repository, such as Bitnami, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-nginx bitnami/nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pulls the nginx chart from the Bitnami repository.&lt;/li&gt;
&lt;li&gt;Deploys NGINX to the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;Creates a Helm release named my-nginx.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Verifying the Deployment&lt;/h3&gt;
&lt;p&gt;Once the chart is installed, verify that the release is active:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will output something like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;NAME        NAMESPACE   REVISION    UPDATED                  STATUS      CHART        APP VERSION
my-nginx    default     1           2024-02-16 10:00:00     deployed    nginx-1.2.3  1.21.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can check the detailed status of a release:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm status my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To view the created Kubernetes resources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl get pods
kubectl get svc
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Customizing Helm Releases&lt;/h3&gt;
&lt;p&gt;Helm allows you to override default values using the &lt;code&gt;--set&lt;/code&gt; flag or a custom values file.&lt;/p&gt;
&lt;h4&gt;Using the &lt;code&gt;--set&lt;/code&gt; Flag&lt;/h4&gt;
&lt;p&gt;You can override individual values like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx bitnami/nginx --set replicaCount=3
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Using a Custom values.yaml File&lt;/h4&gt;
&lt;p&gt;To provide multiple custom values, create a &lt;code&gt;my-values.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3
service:
  type: LoadBalancer
  port: 8080
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, deploy the chart with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx bitnami/nginx -f my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Upgrading a Helm Release&lt;/h3&gt;
&lt;p&gt;If you need to modify a running deployment, use the helm upgrade command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm upgrade my-nginx bitnami/nginx --set replicaCount=5
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To upgrade using a modified values file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm upgrade my-nginx bitnami/nginx -f my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This updates the deployment while keeping existing resources intact.&lt;/p&gt;
&lt;h3&gt;Step 5: Rolling Back to a Previous Version&lt;/h3&gt;
&lt;p&gt;Helm maintains a history of releases, allowing you to roll back if needed.&lt;/p&gt;
&lt;p&gt;List the release history:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm history my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Roll back to a specific revision:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm rollback my-nginx 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 6: Uninstalling a Helm Release&lt;/h3&gt;
&lt;p&gt;To remove a Helm deployment and all its associated resources, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To confirm deletion:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
kubectl get all
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Helm Best Practices&lt;/h2&gt;
&lt;p&gt;Using Helm effectively requires following best practices to ensure maintainability, security, and scalability of deployments. This section outlines key strategies for optimizing Helm usage in production environments.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Organizing Values in &lt;code&gt;values.yaml&lt;/code&gt; for Clarity&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;A well-structured &lt;code&gt;values.yaml&lt;/code&gt; file improves readability and maintainability.&lt;/p&gt;
&lt;h4&gt;✅ &lt;strong&gt;Good Example: Structured and Documented&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3 # Number of replicas for high availability

image:
  repository: nginx
  tag: latest
  pullPolicy: IfNotPresent # Pull policy to optimize image fetching

service:
  type: LoadBalancer
  port: 80 # Publicly exposed service port

resources:
  limits:
    cpu: 500m
    memory: 256Mi
  requests:
    cpu: 250m
    memory: 128Mi
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;❌ Bad Example: Unstructured and Unclear&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3
image: nginx:latest
serviceType: LoadBalancer
port: 80
cpu: 500m
memory: 256Mi
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;No clear nesting.&lt;/li&gt;
&lt;li&gt;Missing descriptions for future maintainers.&lt;/li&gt;
&lt;li&gt;Harder to override values at a granular level.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Using helm dependency for Managing Dependencies&lt;/h3&gt;
&lt;p&gt;If your chart depends on other charts (e.g., a database), declare them in Chart.yaml:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;dependencies:
  - name: postgresql
    version: &amp;quot;12.1.3&amp;quot;
    repository: &amp;quot;https://charts.bitnami.com/bitnami&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, update dependencies before installing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm dependency update
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that all required subcharts are installed and properly versioned.&lt;/p&gt;
&lt;h3&gt;3. Leveraging helm secrets for Sensitive Values&lt;/h3&gt;
&lt;p&gt;Avoid storing credentials in values.yaml. Instead, use Helm Secrets to encrypt sensitive values.&lt;/p&gt;
&lt;p&gt;Install the Helm Secrets plugin:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm plugin install https://github.com/zachomedia/helm-secrets
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Encrypt sensitive values using SOPS:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;sops --encrypt --in-place my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Install a chart using encrypted values:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-app ./mychart -f my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures secrets are not stored in plaintext inside version control.&lt;/p&gt;
&lt;h3&gt;4. Automating Helm Deployments in CI/CD Pipelines&lt;/h3&gt;
&lt;p&gt;Integrate Helm with CI/CD tools like GitHub Actions, GitLab CI/CD, or ArgoCD to automate deployments.&lt;/p&gt;
&lt;h4&gt;Example GitHub Actions Workflow for Helm&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Deploy Helm Chart

on:
  push:
    branches:
      - main

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Install Helm
        run: |
          curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

      - name: Deploy to Kubernetes
        run: |
          helm upgrade --install my-app ./mychart --namespace prod
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This automates deployments whenever code is pushed to the main branch.&lt;/p&gt;
&lt;h3&gt;5. Keeping Charts Versioned and Documented&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use semantic versioning in &lt;code&gt;Chart.yaml&lt;/code&gt; (version: 1.2.0).&lt;/li&gt;
&lt;li&gt;Document all available values in &lt;code&gt;README.md&lt;/code&gt;.
Maintain a &lt;code&gt;CHANGELOG.md&lt;/code&gt; to track modifications.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;#6. Managing Multiple Environments (Dev, Staging, Prod)&lt;/h2&gt;
&lt;p&gt;Helm allows environment-specific values with separate values files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-app ./mychart -f values-dev.yaml
helm install my-app ./mychart -f values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures different configurations for testing and production.&lt;/p&gt;
&lt;h3&gt;7. Helm Security Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Avoid running Helm with cluster-wide privileges.&lt;/li&gt;
&lt;li&gt;Restrict Helm Release Names to prevent namespace conflicts.&lt;/li&gt;
&lt;li&gt;Use RBAC policies to limit Helm access.
Regularly update Helm and chart dependencies to patch vulnerabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Organize values.yaml clearly for maintainability.&lt;/li&gt;
&lt;li&gt;Use helm dependency to manage subcharts.&lt;/li&gt;
&lt;li&gt;Secure sensitive values with helm secrets and encryption.&lt;/li&gt;
&lt;li&gt;Automate Helm deployments using CI/CD.&lt;/li&gt;
&lt;li&gt;Maintain versioning, documentation, and separate environments.&lt;/li&gt;
&lt;li&gt;Follow security best practices to protect Kubernetes resources.&lt;/li&gt;
&lt;li&gt;In the next section, we’ll discuss Helm’s role in large-scale production deployments and how to integrate it with GitOps tools like ArgoCD and Flux.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Helm in Production: Managing Complexity at Scale&lt;/h2&gt;
&lt;p&gt;As organizations scale their Kubernetes deployments, managing Helm charts effectively in production becomes crucial. This section explores how Helm integrates with GitOps tools, supports multi-environment management, and follows best practices for high availability and security.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Using GitOps with Helm (ArgoCD &amp;amp; Flux)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;GitOps&lt;/strong&gt; enables declarative infrastructure management, where Helm charts are stored in Git repositories and automatically deployed using tools like &lt;strong&gt;ArgoCD&lt;/strong&gt; and &lt;strong&gt;Flux&lt;/strong&gt;.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Deploying Helm Charts with ArgoCD&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;ArgoCD monitors a Git repository and applies changes automatically.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install ArgoCD&lt;/strong&gt;:&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Deploy a Helm Chart with ArgoCD:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-helm-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/helm-charts.git
    targetRevision: main
    path: mychart
    helm:
      valueFiles:
        - values-prod.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: prod
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply the application manifest:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl apply -f my-helm-app.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;ArgoCD will now continuously sync the Helm chart with the Kubernetes cluster.&lt;/p&gt;
&lt;h4&gt;Using FluxCD for Helm Deployments&lt;/h4&gt;
&lt;p&gt;FluxCD can also automate Helm deployments:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;flux create source git my-helm-repo \
  --url=https://github.com/my-org/helm-charts.git \
  --branch=main

flux create helmrelease my-app \
  --source=GitRepository/my-helm-repo \
  --chart=mychart \
  --namespace=prod
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;GitOps&lt;/strong&gt; ensures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automated rollouts &amp;amp; rollbacks when changes are pushed to Git.&lt;/li&gt;
&lt;li&gt;Version-controlled infrastructure for reproducibility.&lt;/li&gt;
&lt;li&gt;Improved collaboration by managing Helm charts as code.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Managing Multi-Cluster Deployments&lt;/h3&gt;
&lt;p&gt;For enterprises running multiple Kubernetes clusters (e.g., dev, staging, prod), Helm enables consistent deployments across environments.&lt;/p&gt;
&lt;h4&gt;Option 1: Context Switching with kubectl&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl config use-context dev-cluster
helm install my-app ./mychart --namespace dev

kubectl config use-context prod-cluster
helm install my-app ./mychart --namespace prod
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Option 2: Using Helmfile for Multi-Cluster Deployments&lt;/h4&gt;
&lt;p&gt;Helmfile allows managing multiple Helm releases in a declarative format.&lt;/p&gt;
&lt;p&gt;Example helmfile.yaml:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;releases:
  - name: my-app-dev
    namespace: dev
    chart: ./mychart
    values:
      - values-dev.yaml

  - name: my-app-prod
    namespace: prod
    chart: ./mychart
    values:
      - values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Deploy all environments at once:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helmfile apply
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Ensuring High Availability and Reliability&lt;/h3&gt;
&lt;p&gt;Use Helm Hooks: Automate pre-install and post-install tasks.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;annotations:
  &amp;quot;helm.sh/hook&amp;quot;: pre-install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Enable Readiness and Liveness Probes to ensure application health:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;readinessProbe:
  httpGet:
    path: /
    port: 80
  initialDelaySeconds: 5
  periodSeconds: 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use Rolling Updates with strategy to prevent downtime:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Helm Security Best Practices for Production&lt;/h3&gt;
&lt;p&gt;Restrict Helm Permissions using Role-Based Access Control (RBAC):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: prod
  name: helm-user
rules:
  - apiGroups: [&amp;quot;*&amp;quot;]
    resources: [&amp;quot;deployments&amp;quot;, &amp;quot;services&amp;quot;]
    verbs: [&amp;quot;get&amp;quot;, &amp;quot;list&amp;quot;, &amp;quot;create&amp;quot;, &amp;quot;update&amp;quot;, &amp;quot;delete&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Avoid Storing Secrets in values.yaml:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use Kubernetes Secrets and refer to them in Helm templates.
En- crypt secrets with SOPS or use External Secrets Operator.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Implement Image Scanning:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use tools like Trivy or Anchore to scan Helm charts and container images.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Regularly Update Helm and Charts:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Ensure Helm CLI and chart dependencies are up to date.&lt;/li&gt;
&lt;li&gt;Use helm dependency update to pull the latest versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Monitoring and Logging Helm Deployments&lt;/h3&gt;
&lt;p&gt;Track Helm Releases:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list --all-namespaces
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Monitor Deployments with Prometheus &amp;amp; Grafana:&lt;/p&gt;
&lt;p&gt;Install Prometheus using Helm:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Integrate with Grafana for dashboard visualization.&lt;/h4&gt;
&lt;p&gt;Use Helm Logs to Debug Issues:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm get manifest my-app
helm get values my-app
helm get notes my-app
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GitOps tools (ArgoCD, Flux) enable automated Helm deployments.&lt;/li&gt;
&lt;li&gt;Multi-cluster management can be streamlined with Helmfile or Helm contexts.&lt;/li&gt;
&lt;li&gt;High availability practices ensure smooth rolling updates and failovers.&lt;/li&gt;
&lt;li&gt;Security best practices include using RBAC, encrypted secrets, and image scanning.&lt;/li&gt;
&lt;li&gt;Monitoring tools like Prometheus and Grafana help track Helm deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;9. Conclusion and Next Steps&lt;/h2&gt;
&lt;p&gt;Helm simplifies Kubernetes application deployment, making it easier to manage complex workloads with reusable, version-controlled charts. By leveraging Helm, teams can standardize configurations, automate deployments, and integrate with GitOps workflows to achieve reliable and scalable Kubernetes operations.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Helm is the Kubernetes Package Manager&lt;/strong&gt; – It streamlines application deployments by packaging Kubernetes resources into reusable Helm charts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Charts Provide Flexibility&lt;/strong&gt; – Using &lt;code&gt;values.yaml&lt;/code&gt;, teams can easily override configurations without modifying templates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Supports Versioning &amp;amp; Rollbacks&lt;/strong&gt; – The ability to upgrade and roll back releases ensures stability and rapid recovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automation &amp;amp; CI/CD Integration&lt;/strong&gt; – Helm works seamlessly with GitOps tools like &lt;strong&gt;ArgoCD&lt;/strong&gt; and &lt;strong&gt;FluxCD&lt;/strong&gt; to automate deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security &amp;amp; Best Practices Matter&lt;/strong&gt; – Implement &lt;strong&gt;RBAC&lt;/strong&gt;, use &lt;strong&gt;secrets management&lt;/strong&gt;, and ensure &lt;strong&gt;chart dependencies&lt;/strong&gt; are up to date to maintain a secure and efficient Helm workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring &amp;amp; Debugging Are Essential&lt;/strong&gt; – Use &lt;strong&gt;Prometheus&lt;/strong&gt;, &lt;strong&gt;Grafana&lt;/strong&gt;, and Helm’s built-in commands (&lt;code&gt;helm list&lt;/code&gt;, &lt;code&gt;helm get&lt;/code&gt;) to track deployments and troubleshoot issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Next Steps: Continue Learning Helm&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Now that you understand Helm’s capabilities, here are some next steps to deepen your knowledge and practical experience:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Explore Official Helm Documentation&lt;/strong&gt;&lt;br&gt;
📌 &lt;a href=&quot;https://helm.sh/docs/&quot;&gt;Helm Docs&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deploy Real-World Applications with Helm&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Try deploying &lt;strong&gt;WordPress&lt;/strong&gt;, &lt;strong&gt;PostgreSQL&lt;/strong&gt;, or &lt;strong&gt;Redis&lt;/strong&gt; with Helm charts from &lt;a href=&quot;https://artifacthub.io/&quot;&gt;Artifact Hub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Example:&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-wordpress bitnami/wordpress
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Experiment with Custom Helm Charts&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Modify an existing chart or build one from scratch.&lt;/li&gt;
&lt;li&gt;Deploy it to different environments using separate &lt;code&gt;values.yaml&lt;/code&gt; files.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrate Helm with a CI/CD Pipeline&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set up GitHub Actions, GitLab CI/CD, or Jenkins to automate Helm deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learn Advanced Helm Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Helm Hooks&lt;/strong&gt;: Automate tasks before/after deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Subcharts&lt;/strong&gt;: Manage dependencies efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Secrets&lt;/strong&gt;: Encrypt sensitive configurations.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Follow Helm &amp;amp; Kubernetes Communities&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Join the &lt;strong&gt;CNCF Slack&lt;/strong&gt; (#helm-users channel).&lt;/li&gt;
&lt;li&gt;Follow Kubernetes and Helm GitHub discussions for the latest updates.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Helm is an essential tool for Kubernetes administrators and DevOps teams looking to optimize deployment workflows. Whether you are deploying simple microservices or complex cloud-native applications, Helm provides the flexibility, automation, and reliability needed to scale efficiently.&lt;/p&gt;
&lt;p&gt;Start experimenting with Helm today and take your Kubernetes skills to the next level!&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Additional Resources&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Helm Charts Repository&lt;/strong&gt;: &lt;a href=&quot;https://artifacthub.io/&quot;&gt;Artifact Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kubernetes Documentation&lt;/strong&gt;: &lt;a href=&quot;https://kubernetes.io/&quot;&gt;Kubernetes.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ArgoCD for Helm&lt;/strong&gt;: &lt;a href=&quot;https://argo-cd.readthedocs.io/&quot;&gt;ArgoCD Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FluxCD for Helm&lt;/strong&gt;: &lt;a href=&quot;https://fluxcd.io/&quot;&gt;FluxCD Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Security Best Practices&lt;/strong&gt;: &lt;a href=&quot;https://helm.sh/docs/topics/security/&quot;&gt;Helm Security Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Crash Course on Developing AI Applications with LangChain</title><link>https://iceberglakehouse.com/posts/2025-02-crash-course-on-langchain/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-02-crash-course-on-langchain/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Sat, 01 Feb 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_langchain&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Large Language Models (LLMs) have revolutionized the way developers build AI-powered applications, from chatbots to intelligent search systems. However, managing LLM interactions effectively: structuring prompts, handling memory, and integrating external tools, can be complex. This is where &lt;strong&gt;LangChain&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;p&gt;LangChain is an open-source framework designed to simplify working with LLMs, enabling developers to create powerful AI applications with ease. By providing a modular approach, LangChain allows you to compose &lt;strong&gt;prompt templates, chains, memory, and agents&lt;/strong&gt; to build flexible and scalable solutions.&lt;/p&gt;
&lt;p&gt;In this guide, we&apos;ll introduce you to &lt;strong&gt;LangChain&lt;/strong&gt; and its companion libraries, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;langchain_community&lt;/code&gt;: A collection of core integrations and utilities.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;langchain_openai&lt;/code&gt;: A dedicated library for working with OpenAI models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We&apos;ll walk you through key LangChain concepts, installation steps, and practical code examples to help you get started. Whether you&apos;re looking to build chatbots, AI-powered search engines, or decision-making agents, this guide will give you the foundation you need to start developing with LangChain.&lt;/p&gt;
&lt;h2&gt;What is LangChain?&lt;/h2&gt;
&lt;p&gt;LangChain is an open-source framework that simplifies building applications powered by Large Language Models (LLMs). Instead of manually handling prompts, API calls, and responses, LangChain provides a structured way to &lt;strong&gt;chain together different components&lt;/strong&gt; such as prompts, memory, and external tools.&lt;/p&gt;
&lt;h3&gt;Why Use LangChain?&lt;/h3&gt;
&lt;p&gt;Without LangChain, interacting with an LLM typically involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Formatting a prompt manually.&lt;/li&gt;
&lt;li&gt;Sending the request to an API (e.g., OpenAI, Cohere).&lt;/li&gt;
&lt;li&gt;Parsing the response and deciding the next action.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;LangChain automates and streamlines these steps, making it easier to build complex AI applications with minimal effort.&lt;/p&gt;
&lt;h3&gt;Key Use Cases&lt;/h3&gt;
&lt;p&gt;LangChain is widely used for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chatbots &amp;amp; Virtual Assistants&lt;/strong&gt; – Retaining conversation context and improving responses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt; – Enhancing LLM responses by fetching external data sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Processing &amp;amp; Summarization&lt;/strong&gt; – Analyzing and summarizing large documents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Agents&lt;/strong&gt; – Creating autonomous agents that interact with external APIs and databases.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging LangChain’s modular architecture, you can integrate various &lt;strong&gt;models, tools, and memory mechanisms&lt;/strong&gt; to build dynamic AI-driven applications.&lt;/p&gt;
&lt;h2&gt;Core Concepts in LangChain&lt;/h2&gt;
&lt;p&gt;LangChain is built around a modular architecture that allows developers to compose different components into a pipeline. Here are some of the key concepts you need to understand when working with LangChain:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Prompt Templates&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Prompt templates help structure the input given to an LLM. Instead of writing static prompts, you can create dynamic templates that format user inputs into well-structured queries.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

template = PromptTemplate(
    input_variables=[&amp;quot;topic&amp;quot;],
    template=&amp;quot;Explain {topic} in simple terms.&amp;quot;
)

formatted_prompt = template.format(topic=&amp;quot;LangChain&amp;quot;)
print(formatted_prompt)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that every input follows a structured format before being passed to the model.&lt;/p&gt;
&lt;h3&gt;2. LLMs and Model Wrappers&lt;/h3&gt;
&lt;p&gt;LangChain provides an easy way to interface with different LLM providers like OpenAI, Hugging Face, and more.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key&amp;quot;)
response = llm(&amp;quot;What is LangChain?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows you to seamlessly query the LLM without worrying about API details.&lt;/p&gt;
&lt;h3&gt;3. Chains&lt;/h3&gt;
&lt;p&gt;Chains allow you to combine multiple components (e.g., a prompt template and an LLM) into a single workflow.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=template)
response = llm_chain.run(&amp;quot;machine learning&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the prompt is formatted and automatically passed to the LLM, reducing boilerplate code.&lt;/p&gt;
&lt;h3&gt;4. Memory&lt;/h3&gt;
&lt;p&gt;Memory allows your application to retain context between interactions, which is crucial for chatbots and multi-turn conversations.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;Hello&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;Hi, how can I help you?&amp;quot;})
print(memory.load_memory_variables({}))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With memory, LangChain can track past interactions and use them to generate more coherent responses.&lt;/p&gt;
&lt;h3&gt;5. Agents and Tools&lt;/h3&gt;
&lt;p&gt;Agents allow an LLM to make decisions dynamically. Instead of following a predefined sequence, an agent determines which tool to call based on the user’s query.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool

def add_numbers(a, b):
    return a + b

tool = Tool(
    name=&amp;quot;Calculator&amp;quot;,
    func=add_numbers,
    description=&amp;quot;Adds two numbers.&amp;quot;
)

agent = initialize_agent(
    tools=[tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

response = agent.run(&amp;quot;What is 3 + 5?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This enables an LLM to call functions, fetch data, or interact with APIs to generate more intelligent responses.&lt;/p&gt;
&lt;p&gt;By understanding these core concepts, you can start building more structured and powerful AI applications with LangChain. In the next section, we’ll set up LangChain and its companion libraries to start developing real-world applications.&lt;/p&gt;
&lt;h2&gt;Installing LangChain and Companion Libraries&lt;/h2&gt;
&lt;p&gt;Before we start building with LangChain, we need to install the necessary packages. LangChain is modular, meaning that different functionalities are split across separate libraries. The main ones you&apos;ll need are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;langchain&lt;/code&gt;&lt;/strong&gt; – The core LangChain library.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;langchain_community&lt;/code&gt;&lt;/strong&gt; – A collection of integrations for third-party tools and services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;langchain_openai&lt;/code&gt;&lt;/strong&gt; – A dedicated package for working with OpenAI models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/strong&gt; – The OpenAI Python SDK for API access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;1. Installing LangChain and Dependencies&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;You can install the required libraries using &lt;code&gt;pip&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install langchain langchain_community langchain_openai openai
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will install the core LangChain framework along with the OpenAI integration.&lt;/p&gt;
&lt;h3&gt;2. Setting Up an OpenAI API Key&lt;/h3&gt;
&lt;p&gt;If you plan to use OpenAI models, you’ll need an API key. Follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sign up at OpenAI.&lt;/li&gt;
&lt;li&gt;Navigate to your API settings and generate an API key.&lt;/li&gt;
&lt;li&gt;Store your API key securely.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can set your API key in an environment variable:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;export OPENAI_API_KEY=&amp;quot;your_api_key_here&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or pass it directly in your code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os
os.environ[&amp;quot;OPENAI_API_KEY&amp;quot;] = &amp;quot;your_api_key_here&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Verifying the Installation&lt;/h3&gt;
&lt;p&gt;To test if everything is installed correctly, run the following Python script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key_here&amp;quot;)
response = llm(&amp;quot;Say hello in French.&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you see the response &amp;quot;Bonjour!&amp;quot;, then your setup is working properly.&lt;/p&gt;
&lt;h3&gt;4. Understanding the Role of Companion Libraries&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;langchain_community&lt;/strong&gt;: Contains integrations for databases, vector stores, and APIs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;langchain_openai&lt;/strong&gt;: A streamlined package for interacting with OpenAI&apos;s models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Other integrations&lt;/strong&gt;: LangChain supports many LLM providers (Cohere, Hugging Face, etc.), which can be installed separately.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With LangChain and its dependencies installed, you&apos;re ready to start building AI-powered applications. In the next section, we&apos;ll explore how to use LangChain with OpenAI models and create structured workflows.&lt;/p&gt;
&lt;h2&gt;Setting Up and Using LangChain&lt;/h2&gt;
&lt;p&gt;Now that we have LangChain installed, let&apos;s explore how to use it for interacting with LLMs, structuring prompts, and building simple AI workflows.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Connecting to an OpenAI Model&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The first step in using LangChain is to connect to an LLM. We&apos;ll start by using OpenAI&apos;s models.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Example: Basic Query to an OpenAI Model&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key_here&amp;quot;)

response = llm.invoke(&amp;quot;What is LangChain?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This sends a query to OpenAI and prints the response. The invoke method is the recommended way to interact with LLMs in LangChain.&lt;/p&gt;
&lt;h3&gt;2. Working with Prompt Templates&lt;/h3&gt;
&lt;p&gt;A prompt template ensures that user input is formatted consistently before being sent to an LLM. This is useful when you need structured responses.&lt;/p&gt;
&lt;h4&gt;Example: Creating and Using a Prompt Template&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

template = PromptTemplate(
    input_variables=[&amp;quot;topic&amp;quot;],
    template=&amp;quot;Explain {topic} in simple terms.&amp;quot;
)

formatted_prompt = template.format(topic=&amp;quot;machine learning&amp;quot;)
print(formatted_prompt)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This generates a properly structured prompt:
&amp;quot;Explain machine learning in simple terms.&amp;quot;&lt;/p&gt;
&lt;p&gt;You can pass this formatted prompt to an LLM for processing.&lt;/p&gt;
&lt;h3&gt;3. Building a Basic Chain&lt;/h3&gt;
&lt;p&gt;A chain connects multiple components, such as prompts and LLMs, to automate workflows.&lt;/p&gt;
&lt;h4&gt;Example: Using a Chain to Generate Responses&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=template)
response = llm_chain.run(&amp;quot;data science&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, LangChain automatically formats the prompt and sends it to the LLM, reducing manual effort.&lt;/p&gt;
&lt;h3&gt;4. Using Memory to Maintain Context&lt;/h3&gt;
&lt;p&gt;By default, LLMs don’t remember past interactions. LangChain provides memory components to store and retrieve conversation history.&lt;/p&gt;
&lt;h4&gt;Example: Storing Conversation History&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()

# Simulating a conversation
memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;Hello&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;Hi, how can I help you?&amp;quot;})
memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;What is LangChain?&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;LangChain is a framework for working with LLMs.&amp;quot;})

# Retrieving stored interactions
print(memory.load_memory_variables({}))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that previous interactions can be referenced in future queries.&lt;/p&gt;
&lt;h3&gt;5. Implementing an Agent with Tools&lt;/h3&gt;
&lt;p&gt;An agent allows LLMs to dynamically decide which tool to use for a given query. For example, we can create an agent that uses a calculator tool.&lt;/p&gt;
&lt;h4&gt;Example: Creating an Agent to Perform Calculations&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool

# Defining a simple addition function
def add_numbers(a, b):
    return a + b

tool = Tool(
    name=&amp;quot;Calculator&amp;quot;,
    func=add_numbers,
    description=&amp;quot;Adds two numbers.&amp;quot;
)

# Creating an agent with the tool
agent = initialize_agent(
    tools=[tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

# Running the agent
response = agent.run(&amp;quot;What is 5 + 7?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This enables the LLM to recognize when to use the calculator tool instead of responding based purely on its pre-trained knowledge.&lt;/p&gt;
&lt;h3&gt;What’s Next?&lt;/h3&gt;
&lt;p&gt;Now that we&apos;ve covered basic LangChain functionalities, you can start experimenting with more advanced features like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG) –&lt;/strong&gt; Enhancing LLMs with external knowledge sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector Databases –&lt;/strong&gt; Storing and retrieving information efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Tools and APIs –&lt;/strong&gt; Expanding agents to interact with real-world data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next section, we&apos;ll discuss best practices for using LangChain efficiently and how to scale applications for production use.&lt;/p&gt;
&lt;h2&gt;Best Practices and Next Steps&lt;/h2&gt;
&lt;p&gt;Now that you understand the basics of LangChain: connecting to LLMs, structuring prompts, using chains, memory, and agents, let’s discuss some best practices for building efficient and scalable applications.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;1. Optimize Prompt Engineering&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;clear and structured prompt templates&lt;/strong&gt; to get better responses from LLMs.&lt;/li&gt;
&lt;li&gt;Experiment with &lt;strong&gt;few-shot learning&lt;/strong&gt; by providing example inputs and outputs.&lt;/li&gt;
&lt;li&gt;Keep prompts &lt;strong&gt;concise&lt;/strong&gt; to reduce token usage and improve performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Example: Few-Shot Prompting&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

template = PromptTemplate(
    input_variables=[&amp;quot;word&amp;quot;],
    template=&amp;quot;Convert the following word into plural form: {word}\n\nExample:\n- dog -&amp;gt; dogs\n- cat -&amp;gt; cats\n- book -&amp;gt; ?&amp;quot;
)

print(template.format(word=&amp;quot;tree&amp;quot;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Providing examples improves the model&apos;s accuracy.&lt;/p&gt;
&lt;h3&gt;2. Use Memory Efficiently&lt;/h3&gt;
&lt;p&gt;Only use conversation memory when necessary (e.g., chatbots).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose the right memory type:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ConversationBufferMemory –&lt;/strong&gt; Stores all conversation history.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ConversationSummaryMemory –&lt;/strong&gt; Summarizes past interactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ConversationKGMemory –&lt;/strong&gt; Extracts key facts from a conversation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Using Summary Memory&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.memory import ConversationSummaryMemory
from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key&amp;quot;)
memory = ConversationSummaryMemory(llm=llm)

memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;I love pizza.&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;Pizza is a great choice!&amp;quot;})
summary = memory.load_memory_variables({})
print(summary)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This helps reduce storage while maintaining context.&lt;/p&gt;
&lt;h3&gt;3. Handle API Costs and Rate Limits&lt;/h3&gt;
&lt;p&gt;Use token-efficient prompts to reduce API costs.
Implement batch processing for multiple queries.
Monitor API usage with OpenAI’s rate limits in mind.&lt;/p&gt;
&lt;h4&gt;Example: Monitoring Token Usage&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key&amp;quot;, model=&amp;quot;gpt-4&amp;quot;, max_tokens=100)
response = llm(&amp;quot;Summarize the history of AI in 50 words.&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Setting max_tokens prevents excessive token consumption.&lt;/p&gt;
&lt;h3&gt;4. Enhance LLMs with External Knowledge (RAG)&lt;/h3&gt;
&lt;p&gt;Retrieval-Augmented Generation (RAG) improves LLM responses by fetching external data instead of relying solely on pre-trained knowledge.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use vector databases like Pinecone or FAISS for document search.&lt;/li&gt;
&lt;li&gt;Fetch real-time data from APIs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Querying an External Document&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

# Load and embed documents
embeddings = OpenAIEmbeddings(api_key=&amp;quot;your_api_key&amp;quot;)
vectorstore = FAISS.load_local(&amp;quot;faiss_index&amp;quot;, embeddings)

# Query the knowledge base
docs = vectorstore.similarity_search(&amp;quot;What is LangChain?&amp;quot;, k=2)
print(docs)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This retrieves relevant documents to supplement the LLM’s response.&lt;/p&gt;
&lt;h3&gt;5. Scale Applications for Production&lt;/h3&gt;
&lt;p&gt;When moving from prototyping to production, consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Caching responses to avoid redundant API calls.&lt;/li&gt;
&lt;li&gt;Logging interactions for debugging and improvement.&lt;/li&gt;
&lt;li&gt;Implementing user authentication for secured access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Implementing Response Caching&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.cache import InMemoryCache
from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=template)
llm_chain.cache = InMemoryCache()  # Enable caching

response1 = llm_chain.run(&amp;quot;machine learning&amp;quot;)
response2 = llm_chain.run(&amp;quot;machine learning&amp;quot;)  # Cached response
print(response2)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Caching reduces API calls, improving performance and cost-efficiency.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;LangChain provides a powerful framework for building AI applications that leverage Large Language Models (LLMs). By combining &lt;strong&gt;prompt engineering, chains, memory, and agents&lt;/strong&gt;, LangChain simplifies the development process, making it easier to create &lt;strong&gt;chatbots, AI assistants, and retrieval-augmented generation (RAG) applications&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this guide, we covered:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What LangChain is and why it’s useful.&lt;/li&gt;
&lt;li&gt;Core concepts like &lt;strong&gt;prompt templates, chains, memory, and agents&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;How to &lt;strong&gt;install and set up LangChain&lt;/strong&gt; along with &lt;code&gt;langchain_openai&lt;/code&gt; and &lt;code&gt;langchain_community&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Practical &lt;strong&gt;code examples&lt;/strong&gt; for using LangChain with OpenAI models.&lt;/li&gt;
&lt;li&gt;Best practices for &lt;strong&gt;optimizing prompts, managing memory, and reducing API costs&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;How to &lt;strong&gt;scale LangChain applications for production&lt;/strong&gt; using caching and external knowledge retrieval.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By applying these concepts, you can start building &lt;strong&gt;custom AI-powered solutions&lt;/strong&gt; with real-world impact.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Where to Go from Here?&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;If you&apos;re ready to take the next step, consider:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Building a LangChain Project&lt;/strong&gt; – Try creating a chatbot, document summarizer, or an AI-driven search engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exploring Vector Databases&lt;/strong&gt; – Learn how to integrate Pinecone, FAISS, or ChromaDB for RAG applications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Joining the Community&lt;/strong&gt; – Engage with other developers on &lt;a href=&quot;https://github.com/langchain-ai/langchain&quot;&gt;LangChain&apos;s GitHub&lt;/a&gt; or &lt;a href=&quot;https://discord.gg/langchain&quot;&gt;Discord&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;LangChain is continuously evolving, and staying updated with the latest features will help you build &lt;strong&gt;more advanced and efficient AI applications&lt;/strong&gt;. Start experimenting and bring your AI ideas to life!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_langchain&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Data Lakehouse - The Benefits and Enhancing Implementation</title><link>https://iceberglakehouse.com/posts/2025-01-the-data-lakehouse-benefits-and-enhancing/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-01-the-data-lakehouse-benefits-and-enhancing/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Fri, 31 Jan 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse_benefts_solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;data lakehouse&lt;/strong&gt; has been a significant topic in data architecture over the past several years. However, like any high-value trend, it’s easy to get caught up in the hype and lose sight of the &lt;strong&gt;real reasons&lt;/strong&gt; for adopting this new paradigm.&lt;/p&gt;
&lt;p&gt;In this article, I aim to &lt;strong&gt;clarify the key benefits of a lakehouse&lt;/strong&gt;, highlight the &lt;strong&gt;challenges organizations face in implementing one&lt;/strong&gt;, and explore &lt;strong&gt;practical solutions&lt;/strong&gt; to overcome those challenges.&lt;/p&gt;
&lt;h2&gt;The Problems We Are Trying to Solve For&lt;/h2&gt;
&lt;p&gt;Traditionally, running analytics directly on &lt;strong&gt;operational databases (OLTP systems)&lt;/strong&gt; is neither performant nor efficient, as it creates &lt;strong&gt;resource contention&lt;/strong&gt; with transactional workloads that power enterprise operations. The standard solution has been to &lt;strong&gt;offload this data into a data warehouse&lt;/strong&gt;, which optimizes storage for analytics, manages data efficiently, and provides a processing layer for analytical queries.&lt;/p&gt;
&lt;p&gt;However, not all data is structured or fits neatly into a data warehouse. Additionally, storing &lt;strong&gt;all structured data in a data warehouse can be cost-prohibitive&lt;/strong&gt;. As a result, an intermediate layer: a &lt;strong&gt;data lake&lt;/strong&gt;, is often introduced, where copies of data are stored for &lt;strong&gt;ad hoc analysis&lt;/strong&gt; on &lt;strong&gt;distributed storage systems&lt;/strong&gt; like &lt;strong&gt;Amazon S3, ADLS, MinIO, NetApp StorageGRID, Vast Data, Pure Storage, Nutanix, and others&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In large enterprises, different business units often choose &lt;strong&gt;different data warehouses&lt;/strong&gt;, leading to &lt;strong&gt;multiple copies&lt;/strong&gt; of the same data, inconsistently modeled across departments. This fragmentation introduces several challenges:&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Consistency&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;With multiple copies, &lt;strong&gt;business metrics&lt;/strong&gt; can have &lt;strong&gt;different definitions and values&lt;/strong&gt; depending on which department’s data model you reference, leading to &lt;strong&gt;discrepancies in decision-making&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;2. Time to Insight&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;As &lt;strong&gt;data volumes grow&lt;/strong&gt; and the demand for &lt;strong&gt;real-time or near real-time insights&lt;/strong&gt; increases, excessive &lt;strong&gt;data movement&lt;/strong&gt; becomes a bottleneck. Even if individual transactions are fast, the cumulative impact of copying and processing delays data accessibility.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;3. Centralization Bottlenecks&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;To &lt;strong&gt;improve consistency&lt;/strong&gt;, some organizations centralize modeling in an &lt;strong&gt;enterprise-wide data warehouse&lt;/strong&gt; with &lt;strong&gt;department-specific data marts&lt;/strong&gt;. However, this centralization can create &lt;strong&gt;bottlenecks&lt;/strong&gt;, &lt;strong&gt;slowing down access to insights&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;4. Cost&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Every step of data movement incurs &lt;strong&gt;costs&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Compute resources&lt;/strong&gt; for processing,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage costs&lt;/strong&gt; for redundant copies, and&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BI tool expenses&lt;/strong&gt; from multiple teams generating similar &lt;strong&gt;data extracts&lt;/strong&gt; across different tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;5. Governance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Not all enterprise data resides in a &lt;strong&gt;data warehouse&lt;/strong&gt;. There will always be &lt;strong&gt;a long tail of data&lt;/strong&gt; in &lt;strong&gt;external systems&lt;/strong&gt;, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partner-shared data&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data marketplaces&lt;/strong&gt;, or&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Regulatory-restricted environments&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Managing access to a &lt;strong&gt;holistic data picture&lt;/strong&gt; while maintaining &lt;strong&gt;governance and security&lt;/strong&gt; across &lt;strong&gt;distributed sources&lt;/strong&gt; is a significant challenge.&lt;/p&gt;
&lt;p&gt;This is where the &lt;strong&gt;data lakehouse&lt;/strong&gt; emerges as a &lt;strong&gt;solution&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;The Data Lakehouse Solution&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Data warehouses&lt;/strong&gt; provide essential &lt;strong&gt;data management&lt;/strong&gt; capabilities and &lt;strong&gt;ACID guarantees&lt;/strong&gt;, ensuring &lt;strong&gt;consistency and reliability&lt;/strong&gt; in analytics. However, these features have traditionally been &lt;strong&gt;absent from data lakes&lt;/strong&gt;, as data lakes are not inherently data platforms but &lt;strong&gt;repositories of raw data stored on open storage&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If we &lt;strong&gt;bring data management and ACID transactions&lt;/strong&gt; to the &lt;strong&gt;data lake&lt;/strong&gt;, organizations can work with &lt;strong&gt;a single canonical copy&lt;/strong&gt; directly within the lake, eliminating the need to replicate data across &lt;strong&gt;multiple data warehouses&lt;/strong&gt;. This transformation turns the &lt;strong&gt;data lake into a data warehouse&lt;/strong&gt; - hence the term &lt;strong&gt;data lakehouse&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This is achieved by adopting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Table Formats&lt;/strong&gt; like &lt;strong&gt;Apache Iceberg, Apache Hudi, Delta Lake, or Apache Paimon&lt;/strong&gt;, enabling &lt;strong&gt;Parquet files&lt;/strong&gt; to act as &lt;strong&gt;structured, ACID-compliant tables&lt;/strong&gt; optimized for analytics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Catalogs&lt;/strong&gt; like &lt;strong&gt;Apache Polaris, Nessie, Apache Gravitino, Lakekeeper, and Unity&lt;/strong&gt;, which provide &lt;strong&gt;metadata tracking&lt;/strong&gt; for seamless data discovery and access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Managed Catalog Services&lt;/strong&gt; (e.g., &lt;strong&gt;Dremio&lt;/strong&gt;), which &lt;strong&gt;automate data optimization and governance&lt;/strong&gt;, reducing unnecessary data movement.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;Key Benefits of a Lakehouse Approach&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;✅ &lt;strong&gt;Lower costs&lt;/strong&gt; by reducing &lt;strong&gt;data replication and processing overhead&lt;/strong&gt;.&lt;br&gt;
✅ &lt;strong&gt;Improved consistency&lt;/strong&gt; by maintaining &lt;strong&gt;a single source of truth&lt;/strong&gt;.&lt;br&gt;
✅ &lt;strong&gt;Faster time to insight&lt;/strong&gt; with &lt;strong&gt;direct access to analytics-ready data&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Challenges That Remain&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Despite its advantages, a lakehouse alone does not &lt;strong&gt;completely solve all challenges&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Migration Delays&lt;/strong&gt; – Moving existing data &lt;strong&gt;takes time&lt;/strong&gt;, delaying the &lt;strong&gt;full benefits&lt;/strong&gt; of a lakehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed Data Sources&lt;/strong&gt; – Not all data resides in the lakehouse; &lt;strong&gt;external data&lt;/strong&gt; remains a challenge.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BI Tool Extracts&lt;/strong&gt; – Users &lt;strong&gt;may still create&lt;/strong&gt; redundant &lt;strong&gt;isolated extracts&lt;/strong&gt;, increasing costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where the &lt;strong&gt;Dremio Lakehouse Platform&lt;/strong&gt; fills the gap.&lt;/p&gt;
&lt;h2&gt;The Dremio Solution&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com&quot;&gt;Dremio is a &lt;strong&gt;lakehouse platform&lt;/strong&gt;&lt;/a&gt; that integrates &lt;strong&gt;four key capabilities&lt;/strong&gt; into a &lt;strong&gt;holistic data integration solution&lt;/strong&gt;, addressing the remaining &lt;strong&gt;lakehouse challenges&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. High-Performance Federated Query Engine&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Best-in-class &lt;strong&gt;raw query performance&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Federates queries across &lt;strong&gt;lakehouse catalogs, data lakes, databases, and warehouses&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Provides a &lt;strong&gt;centralized experience&lt;/strong&gt; across &lt;strong&gt;disparate data sources&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;2. Semantic Layer&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Enables &lt;strong&gt;virtual data marts&lt;/strong&gt; without data duplication.&lt;/li&gt;
&lt;li&gt;Built-in &lt;strong&gt;wiki and search&lt;/strong&gt; for &lt;strong&gt;dataset documentation&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Standardizes &lt;strong&gt;business metrics and datasets&lt;/strong&gt; across all tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;3. Query Acceleration&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt; replace traditional &lt;strong&gt;materialized views and BI cubes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Raw Reflections&lt;/strong&gt; (precomputed query results).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregate Reflections&lt;/strong&gt; (optimized aggregations for fast analytics).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automatic query acceleration&lt;/strong&gt;, with no effort required from analysts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;4. Integrated Lakehouse Catalog&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Tracks and &lt;strong&gt;manages Apache Iceberg tables&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automates maintenance and cleanup&lt;/strong&gt; of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provides centralized, portable governance&lt;/strong&gt; across all queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;The Dremio Advantage&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;✅ &lt;strong&gt;Instant Lakehouse Benefits&lt;/strong&gt; – Get the advantages &lt;strong&gt;immediately&lt;/strong&gt;, even before full migration.&lt;br&gt;
✅ &lt;strong&gt;Improved Consistency&lt;/strong&gt; – Ensure &lt;strong&gt;a unified definition of business metrics&lt;/strong&gt;.&lt;br&gt;
✅ &lt;strong&gt;High-Performance Analytics&lt;/strong&gt; – Federated queries + &lt;strong&gt;Reflections&lt;/strong&gt; accelerate workloads.&lt;br&gt;
✅ &lt;strong&gt;Automated Management&lt;/strong&gt; – No &lt;strong&gt;manual cleanup&lt;/strong&gt; of lakehouse tables needed.&lt;br&gt;
✅ &lt;strong&gt;Centralized Governance&lt;/strong&gt; – Unified &lt;strong&gt;access control&lt;/strong&gt; across &lt;strong&gt;all tools and sources&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;data lakehouse&lt;/strong&gt; represents a transformative shift in data architecture, solving long-standing challenges around &lt;strong&gt;data consistency, cost, and accessibility&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;However, simply adopting a &lt;strong&gt;lakehouse format&lt;/strong&gt; isn’t enough. Organizations need &lt;strong&gt;a lakehouse solution that integrates data management, acceleration, and governance&lt;/strong&gt; to fully unlock the benefits.&lt;/p&gt;
&lt;p&gt;Dremio provides that &lt;strong&gt;missing piece&lt;/strong&gt; with:&lt;br&gt;
✅ &lt;strong&gt;Federated query capabilities&lt;/strong&gt;&lt;br&gt;
✅ &lt;strong&gt;A built-in semantic layer&lt;/strong&gt;&lt;br&gt;
✅ &lt;strong&gt;Automated query acceleration&lt;/strong&gt;&lt;br&gt;
✅ &lt;strong&gt;A fully managed lakehouse catalog&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;With &lt;strong&gt;Dremio&lt;/strong&gt;, organizations &lt;strong&gt;don’t just implement a lakehouse&lt;/strong&gt; - they &lt;strong&gt;enhance it&lt;/strong&gt;, unlocking its &lt;strong&gt;full potential&lt;/strong&gt; for faster insights, better decision-making, and long-term cost savings.&lt;/p&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse_benefts_solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025 Comprehensive Guide to Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2025-01-2025-comprehensive-apache-iceberg-guide/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-01-2025-comprehensive-apache-iceberg-guide/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2025-01-2025-guide-to-apache-iceberg/).
...</description><pubDate>Mon, 20 Jan 2025 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2025-01-2025-guide-to-apache-iceberg/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://university.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025-iceberg-comp-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025-iceberg-comp-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg had a monumental 2024, with significant announcements and advancements from major players like Dremio, Snowflake, Databricks, AWS, and other leading data platforms. The Iceberg ecosystem is evolving rapidly, making it essential for professionals to stay up-to-date with the latest innovations. To help navigate this ever-changing space, I’m introducing an annual guide dedicated to Apache Iceberg. This guide aims to provide a comprehensive overview of Iceberg, highlight key resources, and offer valuable insights for anyone looking to deepen their knowledge. Whether you’re just starting with Iceberg or are a seasoned user, this guide will serve as your go-to resource for 2025.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/migration-guide-for-apache-iceberg-lakehouses/&quot;&gt;Read this article for details on migrating to Apache Iceberg.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;What is a Table Format?&lt;/h2&gt;
&lt;p&gt;A table format, often referred to as an “open table format” or “lakehouse table format,” is a foundational component of the data lakehouse architecture. This architecture is gaining popularity for its ability to address the complexities of modern data management. Table formats transform how data stored in collections of analytics-optimized Parquet files is accessed and managed. Instead of treating these files as standalone units to be opened and read individually, a table format enables them to function like traditional database tables, complete with ACID guarantees.&lt;/p&gt;
&lt;p&gt;With a table format, users can interact with data through SQL to create, read, update, and delete records, bringing the functionality of a data warehouse directly to the data lake. This capability allows enterprises to treat their data lake as a unified platform, supporting both data warehousing and data lake use cases. It also enables teams across an organization to work with a single copy of data in their tool of choice : whether for analytics, machine learning, or operational reporting , eliminating redundant data movements, reducing costs, and improving consistency across the enterprise.&lt;/p&gt;
&lt;p&gt;Currently, there are four primary table formats driving innovation in this space:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg:&lt;/strong&gt; Originating from Netflix, this blog’s focus, Iceberg is known for its flexibility and robust support for big data operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Developed by Databricks, it emphasizes simplicity and seamless integration with their ecosystem.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi:&lt;/strong&gt; Created by Uber, Hudi focuses on real-time data ingestion and incremental processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Paimon:&lt;/strong&gt; Emerging from the Apache Flink Project, Paimon is designed to optimize streaming and batch processing use cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these table formats plays a role in the evolving data lakehouse landscape, enabling organizations to unlock the full potential of their data lakehouse.&lt;/p&gt;
&lt;h2&gt;How Table Formats Work&lt;/h2&gt;
&lt;p&gt;At the core of every table format is a metadata layer that transforms collections of files into a table-like structure. This metadata serves as a blueprint for understanding the data, providing essential details such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Files included in the table:&lt;/strong&gt; Identifying the physical Parquet or similar files that make up the dataset.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partitioning scheme:&lt;/strong&gt; Detailing how the data is partitioned to optimize query performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema:&lt;/strong&gt; Defining the structure of the table, including column names, data types, and constraints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot history:&lt;/strong&gt; Tracking changes over time, such as additions, deletions, and updates to the table, enabling features like time travel and rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This metadata acts as an entry point, allowing tools to treat the underlying files as a cohesive table. Instead of scanning all files in a directory, query engines use the metadata to understand the structure and contents of the table. Additionally, the metadata often includes statistics about partitions and individual files. These statistics enable advanced query optimization techniques, such as pruning or skipping files that are irrelevant to a specific query, significantly improving performance.&lt;/p&gt;
&lt;p&gt;While all table formats rely on metadata to bridge the gap between raw files and table functionality, each format structures and optimizes its metadata differently. These differences can influence performance, compatibility, and the features each format provides.&lt;/p&gt;
&lt;h2&gt;How Apache Iceberg’s Metadata is Structured&lt;/h2&gt;
&lt;p&gt;Apache Iceberg’s metadata structure is what enables it to transform raw data files into highly performant and queryable tables. This structure consists of several interrelated components, each designed to provide specific details about the table and optimize query performance. Here’s an overview of Iceberg’s key metadata elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;metadata.json&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The metadata.json file is the primary entry point for understanding the table.&lt;/li&gt;
&lt;li&gt;This semi-structured JSON object contains information about the table’s schema, partitioning scheme, snapshot history, and other critical details.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manifest List&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each snapshot in Iceberg has a corresponding Avro-based “manifest list.” This list contains rows representing each manifest (a group of files) that makes up the snapshot.&lt;/li&gt;
&lt;li&gt;Each row includes:
&lt;ul&gt;
&lt;li&gt;The file location of the manifest.&lt;/li&gt;
&lt;li&gt;Partition value information for the files in the manifest.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;This information allows query engines to prune unnecessary manifests and avoid scanning irrelevant partitions, improving query efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manifests&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A manifest lists one or more Parquet files and includes statistics about each file, such as column summaries.&lt;/li&gt;
&lt;li&gt;These statistics allow query engines to determine whether a file contains data relevant to the query, enabling file skipping for improved performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delete Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Delete files track records that have been deleted as part of “merge-on-read” updates. During queries, the engine reconciles these files with the base data, ensuring that deleted records are ignored.&lt;/li&gt;
&lt;li&gt;There is ongoing discussion about transitioning from delete files to a “deletion vector” approach, inspired by Delta Lake, where deletions are tracked using Puffin files. As of this writing, this proposal has not yet been implemented.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Puffin Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Puffin files are a format for tracking binary blobs and other metadata, designed to optimize queries for engines that choose to leverage them.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Stats Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;These files summarize statistics at the partition level, enabling even greater optimization for queries that rely on partitioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Evolution of Iceberg’s Specification&lt;/h2&gt;
&lt;p&gt;Apache Iceberg’s specification is constantly evolving through community contributions and proposals. These innovations benefit the entire ecosystem, as improvements made by one platform are shared across others. For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partition Stats Files originated from work by Dremio to enhance query optimization.&lt;/li&gt;
&lt;li&gt;Puffin Files were introduced by the Trino community to improve how Iceberg tracks metadata.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This collaborative approach ensures that Apache Iceberg continues to evolve as a cutting-edge table format for modern data lakehouses.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-metadata-tables/&quot;&gt;Read this article on the Apache Iceberg Metadata tables.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;The Role of Catalogs in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;One of the key features of Apache Iceberg is its immutable file structure, which makes snapshot isolation possible. Every time the data or structure of a table changes, a new metadata.json file is generated. This immutability raises an important question: how does a tool know which metadata.json file is the latest one?&lt;/p&gt;
&lt;p&gt;This is where Lakehouse Catalogs come into play. A Lakehouse Catalog serves as an abstraction layer that tracks each table’s name and links it to the most recent metadata.json file. When a table’s data or structure is updated, the catalog is also updated to point to the new metadata.json file. This update is the final step in any transaction, ensuring that the change is completed successfully and meets the atomicity requirement of ACID compliance.&lt;/p&gt;
&lt;p&gt;Lakehouse Catalogs are distinct from Enterprise Data Catalogs or Metadata Catalogs, such as those provided by companies like Alation and Collibra. While Lakehouse Catalogs focus on managing the technical details of tables and transactions, enterprise data catalogs are designed for end-users. They act as tools to help users discover, understand, and request access to datasets across an organization, enhancing data governance and usability.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/&quot;&gt;Read this article to learn more about Iceberg catalogs.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;The Apache Iceberg REST Catalog Spec&lt;/h2&gt;
&lt;p&gt;As more catalog implementations emerged, each with unique features and APIs, interoperability between tools and catalogs became a significant challenge. This lack of a unified standard created a bottleneck for seamless table management and cross-platform compatibility.&lt;/p&gt;
&lt;p&gt;To address this issue and drive innovation, the REST Catalog specification was developed. Rather than requiring all catalog providers to adopt a standardized server-side implementation, the specification introduced a universal REST API interface. This approach ensures that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tools and systems can rely on a consistent, client-side library to interact with catalogs.&lt;/li&gt;
&lt;li&gt;Catalog providers maintain the flexibility to implement their server-side systems in ways that suit their needs, as long as they adhere to the standard REST endpoints outlined in the specification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the REST Catalog specification, interoperability and ease of integration have dramatically improved. This innovation allows developers and enterprises to adopt or build catalogs that align with their technical and business requirements while still being compatible with any tool that supports the REST API interface. This forward-thinking design has strengthened the role of catalogs in modern lakehouse architectures, ensuring that Iceberg tables remain accessible and manageable across diverse platforms.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/what-iceberg-rest-catalog-is-and-isnt-b4a6d056f493&quot;&gt;Read more about the Iceberg REST Spec in this article.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Soft Deletes vs Hard Deleting Data&lt;/h2&gt;
&lt;p&gt;When working with table formats like Apache Iceberg, it’s important to understand how data deletion is handled. Unlike traditional databases, where deleted data is immediately removed from the storage layer, Iceberg follows a different approach to maintain snapshot isolation and enable features like time travel.&lt;/p&gt;
&lt;p&gt;When you execute a delete query, the data is not physically deleted. Instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A new snapshot is created where the deleted data is no longer present.&lt;/li&gt;
&lt;li&gt;The original data files remain intact because the old snapshots are still valid and accessible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach allows users to query previous versions of the table using time travel, providing a powerful mechanism for auditing, debugging, and historical analysis.&lt;/p&gt;
&lt;p&gt;However, this also means that data marked for deletion continues to occupy storage until it is physically removed. To address this, snapshot expiration procedures are performed during table maintenance using tools like Spark or Dremio. These procedures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Invalidate old snapshots that are no longer needed.&lt;/li&gt;
&lt;li&gt;Remove the associated data files from storage, freeing up space.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Regular maintenance is a critical part of managing Iceberg tables to ensure storage efficiency and maintain optimal performance while leveraging the benefits of its snapshot-based architecture.&lt;/p&gt;
&lt;h2&gt;Optimizing Iceberg Data&lt;/h2&gt;
&lt;h3&gt;Minimizing Storage&lt;/h3&gt;
&lt;p&gt;The first step in reducing storage costs is selecting the right compression algorithm for your data. Compression not only reduces the amount of space required to store data but can also improve performance by accelerating data transfer across networks. These compression settings can typically be adjusted at both the table and query engine levels to suit your specific use case.&lt;/p&gt;
&lt;h3&gt;Improving Performance&lt;/h3&gt;
&lt;p&gt;Optimizing performance largely depends on how data is distributed across files. This can be achieved through regular maintenance procedures using tools like Spark or Dremio. These optimizations result in two key outcomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compaction&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reduces the number of small files and consolidates delete files into fewer, larger files.&lt;/li&gt;
&lt;li&gt;Minimizes the number of I/O operations required during query execution, leading to faster reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Clustering/Sorting&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reorganizes data to co-locate similar records within the same files based on commonly queried fields.&lt;/li&gt;
&lt;li&gt;Allows query engines to skip more files during a query, as the data being searched for is concentrated in a smaller subset of files.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging these strategies, Iceberg users can maintain a balance between efficient storage and fast query performance, ensuring their data lakehouse operates at peak efficiency. Regular maintenance is essential for reaping the full benefits of these optimizations.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/guide-to-maintaining-an-apache-iceberg-lakehouse/&quot;&gt;Read this article for more detail on optimizing Apache Iceberg tables.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Hands-on Tutorials&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-json-csv-and-parquet-to-dashboards-with-apache-iceberg-and-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mongodb-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-sqlserver-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-postgres-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/experience-the-dremio-lakehouse-hands-on-with-dremio-nessie-iceberg-data-as-code-and-dbt/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-elasticsearch-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mysql-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-apache-druid-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/bi-dashboards-with-apache-iceberg-using-aws-glue-and-apache-superset/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-spark-dremio-superset-c076a56eaa75&quot;&gt;End-to-End Basic Data Engineering Tutorial (Spark, Apache Iceberg Dremio, Superset)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability</title><link>https://iceberglakehouse.com/posts/2025-01-xtable-or-uniform/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-01-xtable-or-uniform/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Tue, 07 Jan 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The value of the &lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;lakehouse model&lt;/a&gt;, along with the concept of &amp;quot;shifting left&amp;quot; by moving more data modeling and processing from the data warehouse to the data lake, has seen significant buy-in and adoption over the past few years. A lakehouse integrates data warehouse functionality into a data lake using open table formats, offering the best of both worlds for analytics and storage.&lt;/p&gt;
&lt;p&gt;Enabling lakehouse architecture with open table formats like Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon has introduced the need to manage interoperability between these formats, especially at the boundaries of data systems. While many lakehouse implementations operate seamlessly with a single table format, scenarios arise where multiple formats are involved. To address these challenges, several solutions have emerged.&lt;/p&gt;
&lt;p&gt;In this blog, we will explore these solutions and discuss when it makes sense to use them.&lt;/p&gt;
&lt;h2&gt;The Solutions&lt;/h2&gt;
&lt;p&gt;There are primarily two types of interoperability solutions for working across different table formats:&lt;/p&gt;
&lt;h3&gt;1. Mirroring Metadata&lt;/h3&gt;
&lt;p&gt;These solutions focus on maintaining metadata for the same data files in multiple formats, enabling seamless interaction across systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache XTable:&lt;/strong&gt;&lt;br&gt;
An open-source project initially developed at Onehouse and now managed by the community, Apache XTable enables bi-directional metadata conversion between different table formats. It includes incremental metadata update features, ensuring efficiency and consistency. For Iceberg, XTable generates the metadata, which can then be registered with your preferred catalog.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake Uniform:&lt;/strong&gt;&lt;br&gt;
A feature of the Delta Lake format, Delta Lake Uniform allows you to natively write to Delta Lake tables while maintaining a secondary metadata set in Iceberg or Hudi. For Iceberg, it can sync these tables to a Hive Metastore or Unity Catalog. When used with Unity Catalog, these tables can also be exposed for reading through an Iceberg REST Catalog interface, enabling greater flexibility and integration.&lt;/p&gt;
&lt;h3&gt;2. Data Unification Platforms&lt;/h3&gt;
&lt;p&gt;Unified Lakehouse Platforms like &lt;strong&gt;Dremio&lt;/strong&gt; or open-source query engines such as &lt;strong&gt;Trino&lt;/strong&gt; provide another solution by allowing queries across multiple formats without requiring metadata conversion. This approach enables various table formats to coexist while being queried seamlessly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dremio’s Advantage with Apache Arrow and Reflections:&lt;/strong&gt;&lt;br&gt;
Dremio leverages the power of Apache Arrow to enable in-memory columnar processing, delivering greater performance to Trino. Additionally, Dremio’s &lt;strong&gt;Reflections&lt;/strong&gt; feature provides pre-aggregated, incremental materializations that significantly accelerate query response times especially when paired with Apache Iceberg tables. With its built-in semantic layer, Dremio ensures uniform data models that can be consistently utilized across different teams and tools. This capability enables seamless collaboration, allowing data engineers, analysts, and BI tools to consume data efficiently without requiring duplicate efforts for model creation or maintenance.&lt;/p&gt;
&lt;h2&gt;The Use Cases and Which Solution to Use&lt;/h2&gt;
&lt;h3&gt;1. Joining Delta Lake Tables with On-Prem Data&lt;/h3&gt;
&lt;p&gt;If you&apos;re a Databricks user leveraging the Databricks ecosystem and its features but also have on-premises data you&apos;d like to incorporate into certain workflows, a hybrid tool like &lt;strong&gt;Dremio&lt;/strong&gt; can help. Dremio enables you to read Delta Lake tables directly from cloud storage and federate queries with your on-prem data. However, this approach bypasses the governance settings in Unity Catalog and doesn’t take full advantage of Dremio&apos;s powerful acceleration features, such as &lt;strong&gt;Live Reflections&lt;/strong&gt; and &lt;strong&gt;Incremental Reflections&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A better option is to connect Dremio to Unity Catalog tables and read the Uniform Iceberg version of the metadata. This allows you to maintain Unity Catalog governance while also leveraging Dremio’s advanced acceleration capabilities for optimized query performance.&lt;/p&gt;
&lt;h3&gt;2. Streaming with Hudi and Reading as Iceberg/Delta&lt;/h3&gt;
&lt;p&gt;Apache Hudi is widely used for low-latency, high-frequency upserts in streaming use cases. However, when it comes to consuming this data, broader read support exists for Iceberg and Delta Lake. This is an ideal scenario for &lt;strong&gt;Apache XTable&lt;/strong&gt;, which can handle a continuous, one-way incremental metadata conversion. As data lands in Hudi, XTable can write new metadata in the preferred format, such as Iceberg or Delta, ensuring seamless consumption.&lt;/p&gt;
&lt;h3&gt;3. Using Snowflake and Databricks Side by Side&lt;/h3&gt;
&lt;p&gt;Snowflake in 2024 announce Polaris which has since become a community-run Incubating Apache project. Snowflake offers a managed Polaris service called &lt;strong&gt;Open Catalog&lt;/strong&gt;. Apache Polaris features the ability to connect &amp;quot;external catalogs.&amp;quot; This functionality allows Snowflake to read tables from other Iceberg REST Catalog-compliant systems, such as &lt;strong&gt;Nessie&lt;/strong&gt;, &lt;strong&gt;Gravitino&lt;/strong&gt;, &lt;strong&gt;Lake Keeper&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;Unity Catalog&lt;/strong&gt; directly from Polaris.&lt;/p&gt;
&lt;p&gt;By connecting Unity Catalog as an external catalog, you can utilize &lt;strong&gt;Uniform-enabled tables&lt;/strong&gt; from Delta Lake alongside other datasets within Snowflake, enabling seamless interoperability between Snowflake and Databricks environments.&lt;/p&gt;
&lt;h3&gt;4. Migrating Between Formats&lt;/h3&gt;
&lt;p&gt;If you&apos;re looking to migrate between table formats without rewriting all your data, &lt;strong&gt;Apache XTable&lt;/strong&gt; stands out as the optimal solution. XTable enables smooth transitions allowing you to adopt a new format with minimal disruption to your existing workflows.&lt;/p&gt;
&lt;h2&gt;Limitations to Keep in Mind&lt;/h2&gt;
&lt;p&gt;When using a mirrored metadata approach to interoperability, there are certain trade-offs to be aware of. One key limitation is the loss of write-side optimizations specific to the secondary format, such as &lt;strong&gt;hidden partitioning&lt;/strong&gt; in Iceberg or &lt;strong&gt;deletion vectors&lt;/strong&gt; in Delta Lake. Below is a list of specific limitations when using Uniform or XTable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Uniform-enabled Delta Lake tables&lt;/strong&gt; do not currently support &lt;strong&gt;Liquid Clustering&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deletion Vectors&lt;/strong&gt; cannot be utilized with Uniform-enabled Delta Lake tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;XTable&lt;/strong&gt; supports only &lt;strong&gt;Copy-on-Write&lt;/strong&gt; or &lt;strong&gt;Read-Optimized Views&lt;/strong&gt; of tables.&lt;/li&gt;
&lt;li&gt;XTable has &lt;strong&gt;limited support&lt;/strong&gt; for Delta Lake&apos;s &lt;strong&gt;Generated Columns&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;As organizations increasingly adopt lakehouse architectures, interoperability across multiple table formats has become a critical need. Solutions like &lt;strong&gt;Apache XTable&lt;/strong&gt; and &lt;strong&gt;Delta Lake Uniform&lt;/strong&gt; offer powerful ways to manage metadata and facilitate collaboration between different systems. Whether you&apos;re joining Delta Lake tables with on-premises data, leveraging Hudi for streaming, integrating Snowflake with Databricks, or migrating between formats, these tools provide flexibility and efficiency.&lt;/p&gt;
&lt;p&gt;However, it’s important to evaluate the limitations of each approach to ensure it aligns with your use case. While mirrored metadata solutions simplify interoperability, they come with trade-offs, particularly on the write-side optimizations of the secondary format. By understanding these constraints and leveraging platforms like &lt;strong&gt;Dremio&lt;/strong&gt; for advanced query acceleration and data unification, you can make informed decisions and maximize the potential of your lakehouse ecosystem.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025 Guide to Architecting an Iceberg Lakehouse</title><link>https://iceberglakehouse.com/posts/2024-12-2025-guide-architecting-an-iceberg-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-12-2025-guide-architecting-an-iceberg-lakehouse/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2024-12-2025-guide-architecting-an-icebe...</description><pubDate>Mon, 09 Dec 2024 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2024-12-2025-guide-architecting-an-iceberg-lakehouse/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Another year has passed, and 2024 has been an eventful one for the Apache Iceberg table format. Numerous announcements throughout the year have solidified Apache Iceberg&apos;s position as the industry standard for modern data lakehouse architectures.&lt;/p&gt;
&lt;p&gt;Here are some of the highlights from 2024:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; announced the private preview of the &lt;strong&gt;Hybrid Iceberg Catalog&lt;/strong&gt;, extending governance and table maintenance capabilities for both on-premises and cloud environments, building on the cloud catalog&apos;s general availability from previous years.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake&lt;/strong&gt; announces &lt;strong&gt;Polaris Catalog&lt;/strong&gt;, and then Partners with Dremio, AWS, Google and Microsoft to donate it to the Apache Software Foundation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Upsolver&lt;/strong&gt; introduced native Iceberg support, including table maintenance for streamed data landing in Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Confluent&lt;/strong&gt; unveiled several features aimed at enhancing Iceberg integrations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databricks&lt;/strong&gt; acquired &lt;strong&gt;Tabular&lt;/strong&gt;, a startup founded by Apache Iceberg creators Ryan Blue, Daniel Weeks, and Jason Reid.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS&lt;/strong&gt; announced specialized S3 table bucket types for native Apache Iceberg support.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt; added native Iceberg table support.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Microsoft Fabric&lt;/strong&gt; introduced &amp;quot;Iceberg Links,&amp;quot; enabling seamless access to Iceberg tables within its environment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These advancements, along with many other companies and open-source technologies expanding their support for Iceberg, have made 2024 a remarkable year for the Apache Iceberg ecosystem.&lt;/p&gt;
&lt;p&gt;Looking ahead, there is much to be excited about for Iceberg in 2025, as detailed in &lt;a href=&quot;https://medium.com/data-engineering-with-dremio/10-future-apache-iceberg-developments-to-look-forward-to-in-2025-7292a2a2101d&quot;&gt;this blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With these developments in mind, it&apos;s the perfect time to reflect on how to architect an Apache Iceberg lakehouse. This guide aims to help you design a lakehouse that takes full advantage of Iceberg&apos;s capabilities and the latest industry innovations.&lt;/p&gt;
&lt;h2&gt;Why an Apache Iceberg Lakehouse?&lt;/h2&gt;
&lt;p&gt;Before we dive into the &lt;em&gt;how&lt;/em&gt;, let’s take a moment to reflect on the &lt;em&gt;why&lt;/em&gt;. A lakehouse leverages open table formats like &lt;strong&gt;Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, &lt;strong&gt;Hudi&lt;/strong&gt;, and &lt;strong&gt;Paimon&lt;/strong&gt; to create data warehouse-like tables directly on your data lake. The key advantage of these tables is that they provide the transactional guarantees of a traditional data warehouse without requiring data duplication across platforms or teams.&lt;/p&gt;
&lt;p&gt;This value proposition is a major reason to consider Apache Iceberg in particular. In a world where different teams rely on different tools, Iceberg stands out with the largest ecosystem of tools for reading, writing, and: most importantly, managing Iceberg tables.&lt;/p&gt;
&lt;p&gt;Additionally, recent advancements in portable governance through catalog technologies amplify the benefits of adopting Iceberg. Features like &lt;strong&gt;hidden partitioning&lt;/strong&gt; and &lt;strong&gt;partition evolution&lt;/strong&gt; further enhance Iceberg’s appeal by maximizing flexibility and simplifying partition management. These qualities ensure that you can optimize your data lakehouse architecture for both performnance and cost.&lt;/p&gt;
&lt;h2&gt;Pre-Architecture Audit&lt;/h2&gt;
&lt;p&gt;Before we begin architecting your Apache Iceberg Lakehouse, it’s essential to perform a self-audit to clearly define your requirements. Document answers to the following questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Where is my data currently?&lt;/strong&gt;&lt;br&gt;
Understanding where your data resides: whether on-premises, in the cloud, or across multiple locations, helps you plan for migration, integration, and governance challenges.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Which of my data is the most accessed by different teams?&lt;/strong&gt;&lt;br&gt;
Identifying the most frequently accessed datasets ensures you prioritize optimizing performance for these critical assets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Which of my data is the highest cost generator?&lt;/strong&gt;&lt;br&gt;
Knowing which datasets drive the highest costs allows you to focus on cost-saving strategies, such as tiered storage or optimizing query performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Which data platforms will I still need if I standardize on Iceberg?&lt;/strong&gt;&lt;br&gt;
This helps you assess which existing systems can coexist with Iceberg and which ones may need to be retired or reconfigured.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are the SLAs I need to meet?&lt;/strong&gt;&lt;br&gt;
Service-level agreements (SLAs) dictate the performance, availability, and recovery time objectives your architecture must support.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What tools are accessing my data, and which of those are non-negotiables?&lt;/strong&gt;&lt;br&gt;
Understanding the tools your teams rely on: especially non-negotiable ones, ensures that the ecosystem around your Iceberg lakehouse remains compatible and functional.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my regulatory barriers?&lt;/strong&gt;&lt;br&gt;
Compliance with industry regulations or organizational policies must be factored into your architecture to avoid potential risks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By answering these questions, you can determine which platforms align with your needs and identify the components required to generate, track, consume, and maintain your Apache Iceberg data effectively.&lt;/p&gt;
&lt;h2&gt;The Components of an Apache Iceberg Lakehouse&lt;/h2&gt;
&lt;p&gt;When moving to an Apache Iceberg lakehouse, certain fundamentals are a given - most notably that your data will be stored as &lt;strong&gt;Parquet files&lt;/strong&gt; with &lt;strong&gt;Iceberg metadata&lt;/strong&gt;. However, building a functional lakehouse requires several additional components to be carefully planned and implemented.&lt;/p&gt;
&lt;h3&gt;Key Components of an Apache Iceberg Lakehouse&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;br&gt;
Where will your data be stored? The choice of storage system (e.g., cloud object storage like AWS S3 or on-premises systems) impacts cost, scalability, and performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Catalog&lt;/strong&gt;&lt;br&gt;
How will your tables be tracked and governed? A catalog, such as &lt;strong&gt;Nessie&lt;/strong&gt;, &lt;strong&gt;Hive&lt;/strong&gt;, or &lt;strong&gt;AWS Glue&lt;/strong&gt;, is critical for managing metadata, enabling versioning, and supporting governance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ingestion&lt;/strong&gt;&lt;br&gt;
What tools will you use to write data to your Iceberg tables? Ingestion tools (e.g., &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Flink&lt;/strong&gt;, &lt;strong&gt;Kafka Connect&lt;/strong&gt;) ensure data is efficiently loaded into Iceberg tables in the required format.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integration&lt;/strong&gt;&lt;br&gt;
How will you work with Iceberg tables alongside other data? Integration tools (e.g., &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;Trino&lt;/strong&gt;, or &lt;strong&gt;Presto&lt;/strong&gt;) allow you to query and combine Iceberg tables with other datasets and build a semantic layer that defines common business metrics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consumption&lt;/strong&gt;&lt;br&gt;
What tools will you use to extract value from the data? Whether for training machine learning models, generating BI dashboards, or conducting ad hoc analytics, consumption tools (e.g., &lt;strong&gt;Tableau&lt;/strong&gt;, &lt;strong&gt;Power BI&lt;/strong&gt;, &lt;strong&gt;dbt&lt;/strong&gt;) ensure data is accessible for end-users and teams.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;In this guide, we’ll explore each of these components in detail and provide guidance on how to evaluate and select the best options for your specific use case.&lt;/p&gt;
&lt;h2&gt;Storage: Building the Foundation of Your Iceberg Lakehouse&lt;/h2&gt;
&lt;p&gt;Choosing the right storage solution is critical to the success of your Apache Iceberg lakehouse. Your decision will impact performance, scalability, cost, and compliance. Below, we’ll explore the considerations for selecting cloud, on-premises, or hybrid storage, compare cloud vendors, and evaluate alternative solutions.&lt;/p&gt;
&lt;h3&gt;Reasons to Choose Cloud, On-Premises, or Hybrid Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;:&lt;br&gt;
Cloud storage offers scalability, cost efficiency, and managed services. It’s ideal for businesses prioritizing flexibility, global accessibility, and reduced operational overhead. Examples include &lt;strong&gt;AWS S3&lt;/strong&gt;, &lt;strong&gt;Google Cloud Storage&lt;/strong&gt;, and &lt;strong&gt;Azure Data Lake Storage (ADLS)&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On-Premises Storage&lt;/strong&gt;:&lt;br&gt;
On-premises solutions provide greater control over data and are often preferred for compliance, security, or latency-sensitive workloads. These solutions require significant investment in hardware and maintenance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hybrid Storage&lt;/strong&gt;:&lt;br&gt;
Hybrid storage combines the benefits of both worlds. You can use on-premises storage for sensitive or high-frequency data while leveraging the cloud for archival, burst workloads, or global access.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Considerations When Choosing a Cloud Vendor&lt;/h3&gt;
&lt;p&gt;When selecting a cloud provider, consider the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integration with Your Tech Stack&lt;/strong&gt;:&lt;br&gt;
Ensure the vendor works seamlessly with your compute and analytics tools (e.g., Apache Spark, Dremio).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost Structure&lt;/strong&gt;:&lt;br&gt;
Evaluate storage costs, retrieval fees, and data transfer costs. Some providers, like AWS, offer tiered storage options to optimize costs for infrequent data access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Global Availability and Latency&lt;/strong&gt;:&lt;br&gt;
If your organization operates globally, consider a provider with a robust network of regions to minimize latency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem Services&lt;/strong&gt;:&lt;br&gt;
Consider additional services like data lakes, ML tools, or managed databases provided by the vendor.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Considerations for Alternative Storage Solutions&lt;/h3&gt;
&lt;p&gt;In addition to cloud and traditional on-prem options, there are specialized storage systems to consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;NetApp StorageGrid&lt;/strong&gt;: Optimized for object storage with S3 compatibility and strong data lifecycle management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VAST Data&lt;/strong&gt;: Designed for high-performance workloads, leveraging technologies like NVMe over Fabrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt;: An open-source, high-performance object storage system compatible with S3 APIs, ideal for hybrid environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pure Storage&lt;/strong&gt;: Offers scalable, all-flash solutions for high-throughput workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dell EMC&lt;/strong&gt;: Provides a range of storage solutions for diverse enterprise needs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nutanix&lt;/strong&gt;: Combines hyper-converged infrastructure with scalable object storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Questions to Ask Yourself When Deciding on Storage&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my performance requirements?&lt;/strong&gt;&lt;br&gt;
Determine the latency, throughput, and IOPS needs of your workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is my budget?&lt;/strong&gt;&lt;br&gt;
Consider initial costs, ongoing costs, and scalability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my compliance and security needs?&lt;/strong&gt;&lt;br&gt;
Identify regulatory requirements and whether you need fine-grained access controls or encryption.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How frequently will I access my data?&lt;/strong&gt;&lt;br&gt;
Choose between high-performance tiers and cost-effective archival solutions based on access patterns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Do I need scalability and flexibility?&lt;/strong&gt;&lt;br&gt;
Assess whether your workloads will grow significantly or require frequent adjustments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my geographic and redundancy needs?&lt;/strong&gt;&lt;br&gt;
Decide if data needs to be replicated across regions or stored locally for compliance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Selecting the right storage for your Iceberg lakehouse is a foundational step. By thoroughly evaluating your needs and the available options, you can ensure a storage solution that aligns with your performance, cost, and governance requirements.&lt;/p&gt;
&lt;h2&gt;Catalog: Managing Your Iceberg Tables&lt;/h2&gt;
&lt;p&gt;A lakehouse catalog is essential for tracking your Apache Iceberg tables and ensuring consistent access to the latest metadata across tools and teams. The catalog serves as a centralized registry, enabling seamless governance and collaboration.&lt;/p&gt;
&lt;h3&gt;Types of Iceberg Lakehouse Catalogs&lt;/h3&gt;
&lt;p&gt;Iceberg lakehouse catalogs come in two main flavors:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Self-Managed Catalogs&lt;/strong&gt;&lt;br&gt;
With a self-managed catalog, you deploy and maintain your own catalog system. Examples include &lt;strong&gt;Nessie&lt;/strong&gt;, &lt;strong&gt;Hive&lt;/strong&gt;, &lt;strong&gt;Polaris&lt;/strong&gt;, &lt;strong&gt;Lakekeeper&lt;/strong&gt;, and &lt;strong&gt;Gravitino&lt;/strong&gt;. While this approach requires operational effort to maintain the deployment, it provides portability of your tables and governance capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Managed Catalogs&lt;/strong&gt;&lt;br&gt;
Managed catalogs are provided as a service, offering the same benefits of portability and governance while eliminating the overhead of maintaining the deployment. Examples include &lt;strong&gt;Dremio Catalog&lt;/strong&gt; and &lt;strong&gt;Snowflake&apos;s Open Catalog&lt;/strong&gt;, which are managed versions of Polaris.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Importance of the Iceberg REST Catalog Specification&lt;/h3&gt;
&lt;p&gt;A key consideration when selecting a catalog is whether it supports the &lt;strong&gt;Iceberg REST Catalog Spec&lt;/strong&gt;. This specification ensures compatibility with the broader Iceberg ecosystem, providing assurance that your lakehouse can integrate seamlessly with other Iceberg tools.&lt;/p&gt;
&lt;h4&gt;Catalogs Supporting the REST Spec:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Polaris&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gravitino&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unity Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakekeeper&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Catalogs Without REST Spec Support (Yet):&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hive&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;JDBC&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Choosing the Right Catalog&lt;/h3&gt;
&lt;p&gt;Here are some considerations to guide your choice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you have on-prem data&lt;/strong&gt;:&lt;br&gt;
Dremio Catalog is the only managed catalog offering that allows for on-prem tables to co-exist with cloud tables.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you are already a Snowflake user&lt;/strong&gt;:&lt;br&gt;
Snowflake&apos;s Open Catalog offers an easy path to adopting Iceberg, allowing you to leverage Iceberg while staying within the Snowflake ecosystem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you use Databricks with Delta Lake&lt;/strong&gt;:&lt;br&gt;
Unity Catalog’s &lt;strong&gt;Uniform&lt;/strong&gt; feature allows you to maintain an Iceberg copy of your Delta Lake table metadata, enabling compatibility with the Iceberg ecosystem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you are heavily invested in the AWS ecosystem&lt;/strong&gt;:&lt;br&gt;
AWS Glue provides excellent interoperability within AWS. However, its lack of REST Catalog support may limit its usability outside the AWS ecosystem.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Selecting the right catalog is critical for ensuring your Iceberg lakehouse operates efficiently and integrates well with your existing tools. By understanding the differences between self-managed and managed catalogs, as well as the importance of REST Catalog support, you can make an informed decision that meets your needs for portability, governance, and compatibility.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-thinking-about-apache-iceberg-catalogs-like-nessie-and-apache-polaris-incubating-matters/&quot;&gt;Why Thinking about Apache Iceberg Catalogs Matters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/the-importance-of-dremios-hybrid-lakehouse-catalog-b9ee9937ab4e?source=---------3&quot;&gt;Importance of Dremio&apos;s Hybrid Lakehouse Catalog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Ingesting Data into Iceberg: Managing the Flow of Data&lt;/h2&gt;
&lt;p&gt;Ingesting data into Apache Iceberg tables is a critical step in building a functional lakehouse. The tools and strategies you choose will depend on your infrastructure, data workflows, and resource constraints. Let’s explore the key options and considerations for data ingestion.&lt;/p&gt;
&lt;h3&gt;Managing Your Own Ingestion Clusters&lt;/h3&gt;
&lt;p&gt;For those who prefer complete control, managing your own ingestion clusters offers flexibility and customization. This approach allows you to handle both &lt;strong&gt;batch&lt;/strong&gt; and &lt;strong&gt;streaming&lt;/strong&gt; data using tools like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt;: Ideal for large-scale batch processing and ETL workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Kafka&lt;/strong&gt; or &lt;strong&gt;Apache Flink&lt;/strong&gt;: Excellent choices for real-time streaming data ingestion.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While these tools provide robust capabilities, they require significant effort to deploy, monitor, and maintain.&lt;/p&gt;
&lt;h3&gt;Leveraging Managed Services for Ingestion&lt;/h3&gt;
&lt;p&gt;If operational overhead is a concern, managed services can streamline the ingestion process. These services handle much of the complexity, offering ease of use and scalability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Batch Ingestion Tools&lt;/strong&gt;:&lt;br&gt;
Examples include &lt;strong&gt;Fivetran&lt;/strong&gt;, &lt;strong&gt;Airbyte&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;ETleap&lt;/strong&gt;. These tools are well-suited for scheduled ETL tasks and periodic data loads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streaming Ingestion Tools&lt;/strong&gt;:&lt;br&gt;
Examples include &lt;strong&gt;Upsolver&lt;/strong&gt;, &lt;strong&gt;Delta Stream&lt;/strong&gt;, &lt;strong&gt;Estuary&lt;/strong&gt;, &lt;strong&gt;Confluent&lt;/strong&gt;, and &lt;strong&gt;Decodable&lt;/strong&gt;, which are optimized for real-time data processing and ingestion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Questions to Ask When Selecting Ingestion Tools&lt;/h3&gt;
&lt;p&gt;To narrow down your options and define your hard requirements, consider the following questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is the nature of your data workflow?&lt;/strong&gt;&lt;br&gt;
Determine if your use case primarily involves batch processing, streaming data, or a combination of both.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is your tolerance for operational complexity?&lt;/strong&gt;&lt;br&gt;
Decide whether you want to manage your own clusters or prefer managed services to reduce overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are your performance and scalability requirements?&lt;/strong&gt;&lt;br&gt;
Assess whether your ingestion tool can handle the volume, velocity, and variety of your data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How critical is real-time processing?&lt;/strong&gt;&lt;br&gt;
If near-instantaneous data updates are crucial, prioritize streaming tools over batch processing solutions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is your existing tech stack?&lt;/strong&gt;&lt;br&gt;
Consider tools that integrate well with your current infrastructure, such as cloud services, catalogs, or BI tools.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is your budget?&lt;/strong&gt;&lt;br&gt;
Balance cost considerations between self-managed clusters (higher operational costs) and managed services (subscription-based pricing).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Choosing the right ingestion strategy is essential for ensuring your Iceberg lakehouse runs smoothly. By weighing the trade-offs between managing your own ingestion clusters and leveraging managed services, and by asking the right questions, you can design an ingestion pipeline that aligns with your performance, cost, and operational goals.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/cdc-with-apache-iceberg/&quot;&gt;Apache Iceberg CDC Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/8-tools-for-ingesting-data-into-apache-iceberg/&quot;&gt;8 Tools for Apache Iceberg Ingestion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Data Integration: Bridging the Gap for a Unified Lakehouse Experience&lt;/h2&gt;
&lt;p&gt;Not all your data will migrate to Apache Iceberg immediately - or ever. Moving existing workloads to Iceberg requires thoughtful planning and a phased approach. However, you can still deliver the &amp;quot;Iceberg Lakehouse experience&amp;quot; to your end-users upfront, even if not all your data resides in Iceberg. This is where data integration, data virtualization, or a unified lakehouse platform like &lt;strong&gt;Dremio&lt;/strong&gt; becomes invaluable.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-dremio-and-apache-iceberg/&quot;&gt;How Dremio Enhances the Iceberg Journey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/dremio-best-sql-engine-for-apache-iceberg/&quot;&gt;3 Reasons Dremio is Best Query Engine for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/10-use-cases-for-dremio-in-your-data-architecture-64a98d2be8bc?source=---------0&quot;&gt;10 Use Cases for Dremio in your Data Architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why Dremio for Data Integration?&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Access Across Data Sources&lt;/strong&gt;&lt;br&gt;
Dremio allows you to connect and query all your data sources in one place. Even if your datasets haven’t yet migrated to Iceberg, you can combine them with Iceberg tables seamlessly. Dremio’s fast query engine ensures performant analytics, regardless of where your data resides.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Built-In Semantic Layer for Consistency&lt;/strong&gt;&lt;br&gt;
Dremio includes a built-in semantic layer to define commonly used datasets across teams. This layer ensures consistent and accurate data usage for your entire organization. Since the semantic layer is based on SQL views, transitioning data from its original source to an Iceberg table is seamless - simply update the SQL definition of the views. Your end-users won’t even notice the change, yet they’ll immediately benefit from the migration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Performance Boost with Iceberg-Based Reflections&lt;/strong&gt;&lt;br&gt;
Dremio’s &lt;strong&gt;Reflections&lt;/strong&gt; feature accelerates queries on your data. When your data is natively in Iceberg, reflections are refreshed incrementally and updated automatically when the underlying dataset changes. This results in faster query performance and reduced maintenance effort. Learn more about reflections in &lt;a href=&quot;https://www.dremio.com/blog/iceberg-lakehouses-and-dremio-reflections/&quot;&gt;this blog post&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Delivering the Lakehouse Experience&lt;/h3&gt;
&lt;p&gt;As more of your data lands in Iceberg, Dremio enables you to seamlessly integrate it into a governed semantic layer. This layer supports a wide range of data consumers, including BI tools, notebooks, and reporting platforms, ensuring all teams can access and use the data they need effectively.&lt;/p&gt;
&lt;p&gt;By leveraging Dremio, you can bridge the gap between legacy data systems and your Iceberg lakehouse, providing a consistent and performant data experience while migrating to Iceberg at a pace that works for your organization.&lt;/p&gt;
&lt;h2&gt;Consumers: Empowering Teams with Accessible Data&lt;/h2&gt;
&lt;p&gt;Once your data is stored, integrated, and organized in your Iceberg lakehouse, the final step is ensuring it can be consumed effectively by your teams. Data consumers rely on various tools for analytics, reporting, visualization, and machine learning. A robust lakehouse architecture ensures that all these tools can access the data they need, even if they don’t natively support Apache Iceberg.&lt;/p&gt;
&lt;h3&gt;Types of Data Consumers and Their Tools&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Python Notebooks&lt;/strong&gt;&lt;br&gt;
Python notebooks, such as &lt;strong&gt;Jupyter&lt;/strong&gt;, &lt;strong&gt;Google Colab&lt;/strong&gt;, or &lt;strong&gt;VS Code Notebooks&lt;/strong&gt;, are widely used by data scientists and analysts for exploratory data analysis, data visualization, and machine learning. These notebooks leverage libraries like &lt;strong&gt;Pandas&lt;/strong&gt;, &lt;strong&gt;PyArrow&lt;/strong&gt;, and &lt;strong&gt;Dask&lt;/strong&gt; to process data from Iceberg tables, often via a platform like Dremio for seamless access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;BI Tools&lt;/strong&gt;&lt;br&gt;
Business intelligence tools like &lt;strong&gt;Tableau&lt;/strong&gt;, &lt;strong&gt;Power BI&lt;/strong&gt;, and &lt;strong&gt;Looker&lt;/strong&gt; are used to create interactive dashboards and reports. While these tools may not natively support Iceberg, Dremio acts as a bridge, providing direct access to Iceberg tables and unifying them with other datasets through its semantic layer.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reporting Tools&lt;/strong&gt;&lt;br&gt;
Tools such as &lt;strong&gt;Crystal Reports&lt;/strong&gt;, &lt;strong&gt;Microsoft Excel&lt;/strong&gt;, and &lt;strong&gt;Google Sheets&lt;/strong&gt; are commonly used for generating structured reports. Dremio&apos;s integration capabilities make it easy for reporting tools to query Iceberg tables alongside other data sources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Machine Learning Platforms&lt;/strong&gt;&lt;br&gt;
Platforms like &lt;strong&gt;Databricks&lt;/strong&gt;, &lt;strong&gt;SageMaker&lt;/strong&gt;, or &lt;strong&gt;Azure ML&lt;/strong&gt; require efficient access to large datasets for training models. With Dremio, these platforms can query Iceberg tables directly or through unified views, simplifying data preparation workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ad Hoc Querying Tools&lt;/strong&gt;&lt;br&gt;
Tools like &lt;strong&gt;DBeaver&lt;/strong&gt;, &lt;strong&gt;SQL Workbench&lt;/strong&gt;, or even command-line utilities are popular among engineers and analysts for quick SQL-based data exploration. These tools can connect to Dremio to query Iceberg tables without additional configuration.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Dremio as the Integration Layer&lt;/h3&gt;
&lt;p&gt;Most platforms, even if they don’t have native Iceberg capabilities, can leverage Dremio to access Iceberg tables alongside other datasets. Here’s how Dremio enhances the consumer experience:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Data Access&lt;/strong&gt;:&lt;br&gt;
Dremio’s ability to virtualize data from multiple sources means that end-users don’t need to know where the data resides. Whether it’s Iceberg tables or legacy systems, all datasets can be queried together.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic Layer&lt;/strong&gt;:&lt;br&gt;
Dremio’s semantic layer defines business metrics and datasets, ensuring consistent definitions across all tools and teams. Users querying data via BI tools or Python notebooks can rely on the same, agreed-upon metrics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Performance Optimization&lt;/strong&gt;:&lt;br&gt;
Dremio’s &lt;strong&gt;Reflections&lt;/strong&gt; accelerate queries, providing near-instant response times for dashboards, reports, and interactive analyses, even with large Iceberg datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By enabling data consumers with tools they already know and use, your Iceberg lakehouse can become a powerful, accessible platform for delivering insights and driving decisions. Leveraging Dremio ensures that even tools without native Iceberg support can fully participate in your data ecosystem, helping you maximize the value of your Iceberg lakehouse.&lt;/p&gt;
&lt;h2&gt;Conclusion: Your Journey to a Seamless Iceberg Lakehouse&lt;/h2&gt;
&lt;p&gt;Architecting an Iceberg Lakehouse is not just about adopting a new technology; it’s about transforming how your organization stores, governs, integrates, and consumes data. This guide has walked you through the essential components: from storage and catalogs to ingestion, integration, and consumption, highlighting the importance of thoughtful planning and the tools available to support your journey.&lt;/p&gt;
&lt;p&gt;Apache Iceberg’s open table format, with its unique features like hidden partitioning, partition evolution, and broad ecosystem support, provides a solid foundation for a modern data lakehouse. By leveraging tools like &lt;strong&gt;Dremio&lt;/strong&gt; for integration and query acceleration, you can deliver the &amp;quot;Iceberg Lakehouse experience&amp;quot; to your teams immediately, even as you transition existing workloads over time.&lt;/p&gt;
&lt;p&gt;As 2025 unfolds, the Apache Iceberg ecosystem will continue to grow, bringing new innovations and opportunities to refine your architecture further. By taking a structured approach and selecting the right tools for your needs, you can build a flexible, performant, and cost-efficient lakehouse that empowers your organization to make data-driven decisions at scale.&lt;/p&gt;
&lt;p&gt;Let this guide be the starting point for your Iceberg Lakehouse journey - designed for today and ready for the future.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>10 Future Apache Iceberg Developments to Look forward to in 2025</title><link>https://iceberglakehouse.com/posts/2024-11-10-Iceberg-developments/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-10-Iceberg-developments/</guid><description>
&gt; **Cross-posted.** This article&apos;s canonical home is [Data Lakehouse Hub](https://datalakehousehub.com/posts/2024-11-10-Iceberg-developments/).

- [B...</description><pubDate>Mon, 25 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posted.&lt;/strong&gt; This article&apos;s canonical home is &lt;a href=&quot;https://datalakehousehub.com/posts/2024-11-10-Iceberg-developments/&quot;&gt;Data Lakehouse Hub&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg remains at the forefront of innovation, redefining how we think about data lakehouse architectures. In 2025, the Iceberg ecosystem is poised for significant advancements that will empower organizations to handle data more efficiently, securely, and at scale. From enhanced interoperability with modern data tools to new features that simplify data management, the year ahead promises to be transformative. In this blog, we’ll explore 10 exciting developments in the Apache Iceberg ecosystem that you should keep an eye on, offering a glimpse into the future of open data lakehouse technology.&lt;/p&gt;
&lt;h2&gt;1. Scan Planning Endpoint in the Iceberg REST Catalog Specification&lt;/h2&gt;
&lt;p&gt;One of the most anticipated updates in the Iceberg ecosystem for 2025 is the addition of a &amp;quot;Scan Planning&amp;quot; endpoint to the Iceberg REST Catalog specification. This enhancement will allow query engines to delegate scan planning: the process of reading metadata to determine which files are needed for a query, to the catalog itself. This new capability opens the door to several exciting possibilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized Scan Planning with Caching&lt;/strong&gt;: By handling scan planning at the catalog level, frequently submitted queries can benefit from cached scan plans. This optimization reduces redundant metadata reads and accelerates query execution, irrespective of the engine used to submit the query.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enhanced Interoperability Between Table Formats&lt;/strong&gt;: With the catalog managing scan planning, the responsibility of supporting table formats shifts from the engine to the catalog. This makes it possible for Iceberg REST-compliant catalogs to facilitate querying tables in multiple formats. For example, a catalog could generate file lists for queries across various table formats, paving the way for broader interoperability.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Looking ahead, the introduction of this endpoint is not only a step toward improving query performance but also a glimpse into a future where catalogs become the central hub for table format compatibility. To fully realize this vision, a similar endpoint for handling metadata writes may be introduced in the future, further extending the catalog&apos;s capabilities.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/pull/11369&quot;&gt;Scan Planning Pull Request&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;2. Interoperable Views in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;Interoperable views are another major development to watch in the Apache Iceberg ecosystem for 2025. While Iceberg already supports a view specification, the current approach has limitations: it stores the SQL used to define the view, but since SQL syntax varies across engines, resolving these views is not always feasible in a multi-engine environment.&lt;/p&gt;
&lt;p&gt;To address this challenge, two promising solutions are being explored:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SQL Transpilation with Frameworks like SQLGlot&lt;/strong&gt;: By leveraging SQL transpilation tools such as SQLGlot, the SQL defining a view can be translated between different dialects. This approach builds on the existing view specification, which includes a &amp;quot;dialect&amp;quot; property to identify the SQL syntax used to define the view. This enables engines to resolve views by translating the SQL into a dialect they support.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Intermediate Representation for Views&lt;/strong&gt;: Another approach involves using an intermediate format to represent views, independent of SQL syntax. Two notable projects being discussed in this context are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Calcite&lt;/strong&gt;: An open-source project that provides a framework for parsing, validating, and optimizing relational algebra queries. Calcite could serve as a bridge, converting SQL into a standardized logical plan that any engine can execute.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Substrait&lt;/strong&gt;: A cross-language specification for defining and exchanging query plans. Substrait focuses on representing queries in a portable, engine-agnostic format, making it a strong candidate for enabling true interoperability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These advancements aim to make views in Iceberg truly interoperable, allowing seamless sharing and resolution of views across different engines and workflows. Whether through SQL transpilation or an intermediate format, these improvements will significantly enhance Iceberg&apos;s flexibility in heterogeneous data environments.&lt;/p&gt;
&lt;h2&gt;3. Materialized Views in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;A materialized view stores a query definition as a logical table, with precomputed data that serves query results. By shifting the computational cost to precomputation, materialized views significantly improve query performance while maintaining flexibility. The Iceberg community is working towards a common metadata format for materialized views, enabling their creation, reading, and updating across different engines.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Features of Iceberg Materialized Views&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;: A materialized view is realized as a combination of an Iceberg view (the &amp;quot;common view&amp;quot;) storing the query definition and a pointer to the precomputed data, and an Iceberg table (the &amp;quot;storage table&amp;quot;) holding the precomputed data. The storage table is marked with states like &amp;quot;fresh,&amp;quot; &amp;quot;stale,&amp;quot; or &amp;quot;invalid&amp;quot; based on its alignment with source table snapshots.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Table State Management&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;fresh&lt;/strong&gt; state indicates the precomputed data is up-to-date.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;stale&lt;/strong&gt; state requires the query engine to decide between full or incremental refresh.&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;invalid&lt;/strong&gt; state mandates a full refresh.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Refresh Mechanisms&lt;/strong&gt;: Materialized views can be refreshed through various methods, including event-driven triggers, query-time checks, scheduled refreshes, or manual operations. These methods ensure the precomputed data remains relevant to the underlying data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Optimization&lt;/strong&gt;: Queries can use precomputed data directly if it meets freshness criteria (e.g., the &lt;code&gt;materialization.data.max-staleness&lt;/code&gt; property). Otherwise, the query engine determines the next steps, such as refreshing the data or falling back to the original view definition.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Interoperability and Governance&lt;/strong&gt;: The shared metadata format supports lineage tracking and consistent states, making materialized views easy to manage and audit across engines.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Impact on the Iceberg Ecosystem&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Materialized views in Iceberg offer a way to optimize query performance while ensuring that optimizations are portable across systems. By providing a standard for metadata and refresh mechanisms, Iceberg hopes to enable organizations to harness the benefits of materialized views without being locked into specific query engines. This development will make Iceberg an even more compelling choice for building scalable, engine-agnostic data lakehouses.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/pull/11041&quot;&gt;Materilized View Pull Request&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;4. Variant Data Format in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The upcoming introduction of the &lt;strong&gt;variant data format&lt;/strong&gt; in Apache Iceberg marks a significant advancement in handling semi-structured data. While Iceberg already supports a JSON data format, the variant data type offers a more efficient and versatile approach to managing JSON-like data, aligning with the Spark variant format.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;How Variant Differs from JSON&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The variant data format is designed to provide a structured representation of semi-structured data, improving performance and usability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Typed Representation&lt;/strong&gt;: Unlike traditional JSON, which treats data as text, the variant format incorporates schema-aware types. This allows for faster processing and easier integration with analytical workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Storage&lt;/strong&gt;: By leveraging columnar storage principles, variant data optimizes storage space and access patterns for semi-structured data, reducing the overhead associated with parsing and serializing JSON.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query Flexibility&lt;/strong&gt;: Variant enables advanced querying capabilities, such as filtering and aggregations, on semi-structured data without requiring extensive transformations or data flattening.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Benefits of the Variant Format&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Improved Performance&lt;/strong&gt;: By avoiding the need to repeatedly parse JSON strings, the variant format enables faster data access and manipulation, making it ideal for high-performance analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better Interoperability&lt;/strong&gt;: With consensus on using the Spark variant format, this addition ensures compatibility across engines that support the same standard.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplified Workflows&lt;/strong&gt;: Variant makes it easier to work with semi-structured data within Iceberg tables, allowing for more straightforward schema evolution and query optimizations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/pull/10831&quot;&gt;Variant Data Format Pull Request&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;5. Native Geospatial Data Type Support in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The integration of geospatial data types into Apache Iceberg is poised to open up powerful capabilities for organizations managing location-based data. While geospatial data has long been supported by big data tools like GeoParquet, Apache Sedona, and GeoMesa, Iceberg&apos;s position as a central table format makes the addition of native geospatial support a natural evolution. Leveraging prior efforts such as Geolake and Havasu, this proposal aims to bring geospatial functionality into Iceberg without the need for project forks.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Proposed Features&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The geospatial extension for Iceberg will introduce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Data Types&lt;/strong&gt;: Support for types like &lt;code&gt;POINT&lt;/code&gt;, &lt;code&gt;LINESTRING&lt;/code&gt;, and &lt;code&gt;POLYGON&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Expressions&lt;/strong&gt;: Functions such as &lt;code&gt;ST_COVERS&lt;/code&gt;, &lt;code&gt;ST_COVERED_BY&lt;/code&gt;, and &lt;code&gt;ST_INTERSECTS&lt;/code&gt; for spatial querying.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Partition Transforms&lt;/strong&gt;: Partitioning using geospatial transforms like &lt;code&gt;XZ2&lt;/code&gt; to optimize query filtering.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Sorting&lt;/strong&gt;: Sorting data with space-filling curves, such as the Hilbert curve, to enhance data locality and query efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spark Integration&lt;/strong&gt;: Built-in support for working with geospatial data in Spark.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Key Use Cases&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Table Creation with Geospatial Types&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;   CREATE TABLE geom_table (geom GEOMETRY);
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Inserting Geospatial Data&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;  INSERT INTO geom_table VALUES (&apos;POINT(1 2)&apos;, &apos;LINESTRING(1 2, 3 4)&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Querying with Geospatial Predicates&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM geom_table WHERE ST_COVERS(geom, ST_POINT(0.5, 0.5));
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Partitioning&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE geom_table ADD PARTITION FIELD (xz2(geom));
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Optimized File Sorting for Geospatial Queries&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL rewrite_data_files(table =&amp;gt; `geom_table`, sort_order =&amp;gt; `hilbert(geom)`);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Efficient Geospatial Analysis&lt;/strong&gt;: By natively supporting geospatial data types and operations, Iceberg will enable faster and more scalable location-based queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Query Optimization&lt;/strong&gt;: Partition transforms and spatial sorting will enhance filtering and reduce data scan overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad Ecosystem Integration&lt;/strong&gt;: With Spark integration and compatibility with geospatial standards like GeoParquet, Iceberg becomes a powerful tool for geospatial data management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/issues/10260&quot;&gt;GeoSpatial Proposal&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;6. Apache Polaris Federated Catalogs&lt;/h2&gt;
&lt;p&gt;Apache Polaris is expanding its capabilities with the concept of &lt;strong&gt;federated catalogs&lt;/strong&gt;, allowing seamless connectivity to external catalogs such as Nessie, Gravitino, and Unity. This feature makes the tables in these external catalogs visible and queryable from a Polaris connection, streamlining Iceberg data federation within a single interface.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Current State&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;At present, Polaris supports &lt;strong&gt;read-only external catalogs&lt;/strong&gt;, enabling users to query and analyze data from connected catalogs without duplicating data or moving it between systems. This functionality simplifies data integration and allows users to leverage the strengths of multiple catalogs from a centralized Polaris environment.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Future Vision: Read/Write Federation&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;There is active discussion and interest within the community to extend this capability to &lt;strong&gt;read/write catalog federation&lt;/strong&gt;. With this enhancement, users will be able to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read&lt;/strong&gt; data from external catalogs as they currently do.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write&lt;/strong&gt; data directly back to external catalogs, making updates, inserts, and schema modifications possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Key Benefits of Federated Catalogs&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Unified Data Access&lt;/strong&gt;: Query data across multiple catalogs without the need for extensive ETL processes or duplication.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Interoperability&lt;/strong&gt;: Leverage the unique features of external catalogs like Nessie and Unity directly within Polaris.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streamlined Workflows&lt;/strong&gt;: Enable read/write operations to external catalogs, reducing friction in workflows that span multiple systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Governance&lt;/strong&gt;: Centralize metadata and access controls while interacting with data stored in different catalogs.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;&lt;strong&gt;The Road Ahead&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The move toward read/write federation make it easier for organizations to manage diverse data ecosystems. By bridging the gap between disparate catalogs, Polaris continues to simplify data management and empower users to unlock the full potential of their data.&lt;/p&gt;
&lt;h2&gt;7. Table Maintenance Service in Apache Polaris&lt;/h2&gt;
&lt;p&gt;A feature beign discussed in the Apache Polaris community is the &lt;strong&gt;table maintenance service&lt;/strong&gt;, designed to streamline table optimization and maintenance workflows. This service would function as a notification system, broadcasting maintenance requests to subscribed tools, enabling automated and efficient table management.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;How It Could Works&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The table maintenance service allows users to configure maintenance triggers based on specific conditions. For example, users could set a table to be optimized every 10 snapshots. When this condition is met, the service broadcasts a notification to subscribed tools such as Dremio, Upsolver and any other service that optimizes Iceberg tables.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Use Cases&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Automated Table Optimization&lt;/strong&gt;: Configure tables to trigger maintenance tasks, such as compaction or sorting, at predefined intervals or based on conditions like snapshot count.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-Tool Integration&lt;/strong&gt;: Seamlessly integrate with multiple tools in the ecosystem, enabling flexible and automated workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cadence Management&lt;/strong&gt;: Ensure maintenance tasks are performed on a schedule or event-driven basis, aligned with the table’s operational needs.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;&lt;strong&gt;Benefits&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduced Operational Overhead&lt;/strong&gt;: Automate repetitive maintenance tasks, minimizing the need for manual intervention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Performance&lt;/strong&gt;: Regular maintenance ensures tables remain optimized for query performance and storage efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Flexibility&lt;/strong&gt;: By supporting a wide range of subscribing tools, the service adapts to diverse data pipelines and optimization strategies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;8. Catalog Versioning in Apache Polaris&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Catalog versioning&lt;/strong&gt;, a transformative feature currently available in the &lt;a href=&quot;https://www.projectnessie.org&quot;&gt;Nessie catalog&lt;/a&gt;, is under discussion for inclusion in the Apache Polaris ecosystem. Adding catalog versioning to Polaris would unlock a range of powerful capabilities, positioning Polaris as a unifying force for the most innovative ideas in the Iceberg catalog space.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;The Power of Catalog Versioning&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Catalog versioning provides a robust foundation for advanced data management scenarios by enabling:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Transactions&lt;/strong&gt;: Ensure atomic operations across multiple tables for consistent updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Rollbacks&lt;/strong&gt;: Revert changes across multiple tables to a consistent state, enhancing error recovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zero-Copy Environments&lt;/strong&gt;: Create lightweight, zero-copy development or testing environments without duplicating data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Isolation&lt;/strong&gt;: Create a branch to isolate work on data without affecting the main branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tagging and Versioning&lt;/strong&gt;: Mark specific states of the catalog for easy access, auditing, or rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Proposed Integration with Polaris&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Discussions around bringing catalog versioning to Polaris also involve designing a new model that aligns with Polaris&apos; architecture. This integration could enable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified Catalog Management&lt;/strong&gt;: Allow users to manage table states and snapshots across all their data directly in Polaris.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Interoperability&lt;/strong&gt;: Unify Polaris&apos; capabilities with the multi-table capabilities of Nessie, creating a comprehensive solution for data management.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Potential Impact&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Advanced Data Workflows&lt;/strong&gt;: Catalog versioning would enable Polaris users to orchestrate complex workflows with confidence and precision.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Collaboration&lt;/strong&gt;: Teams could work in parallel using isolated views of the catalog, fostering innovation without risk to production data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Leadership&lt;/strong&gt;: By adopting catalog versioning, Polaris would become the definitive platform for managing Iceberg catalogs, consolidating the best ideas from the community.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If implemented, catalog versioning in Polaris would elevate its capabilities, making it an indispensable tool for organizations looking to modernize their data lakehouse operations.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Try Catalog Versioning on your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;9. Updates to Iceberg&apos;s Delete File Specification&lt;/h2&gt;
&lt;p&gt;Apache Iceberg’s innovative delete file specification has been central to enabling efficient upserts by managing record deletions with minimal performance overhead. Currently, Iceberg supports two types of delete files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Position Deletes&lt;/strong&gt;: Track the position of a deleted record in a data file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Equality Deletes&lt;/strong&gt;: Track the values being deleted across multiple rows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While these mechanisms are effective, each comes with trade-offs. Position deletes can lead to high I/O costs when reconciling deletions during queries, while equality deletes, though fast to write, impose significant costs during reads and optimizations. Discussions in the Iceberg community propose enhancements to both approaches.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Proposed Changes to Position Deletes&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The key proposal is to transition position deletes from their current file-based storage to &lt;strong&gt;deletion vectors&lt;/strong&gt; within Puffin files. Puffin, a specification for structured metadata storage, allows for compact and efficient storage of additional data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Benefits of Storing Deletion Vectors in Puffin Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduced I/O Costs&lt;/strong&gt;: Instead of opening multiple delete files, engines can read a single blob within a Puffin file, significantly improving read performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streamlined Metadata Access&lt;/strong&gt;: Puffin files consolidate metadata and auxiliary information, simplifying the reconciliation process.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Reimagining Equality Deletes for Streaming&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Another area of discussion is rethinking equality deletes to better suit streaming scenarios. The current design prioritizes fast writes but incurs steep costs for reading and optimizing. Possible enhancements include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streaming-Optimized Delete Mechanisms&lt;/strong&gt;: Developing a model where deletes are reconciled incrementally in real-time, reducing read-time overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid Approaches&lt;/strong&gt;: Combining aspects of position and equality deletes to balance the cost of writes, reads, and optimizations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Impact of These Changes&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Improved Query Performance&lt;/strong&gt;: Faster reconciliation during queries, especially for workloads with high delete volumes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better Streaming Support&lt;/strong&gt;: Lower overhead for real-time processing scenarios, making Iceberg more viable for continuous data ingestion and updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Scalability&lt;/strong&gt;: Reduced I/O during reconciliation improves scalability for large-scale datasets.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;10. General Availability of the Dremio Hybrid Catalog&lt;/h3&gt;
&lt;p&gt;The &lt;strong&gt;Dremio Hybrid Catalog&lt;/strong&gt;, currently in private preview, is set to become generally available sometime in 2025. Built on the foundation of the Polaris catalog, this managed Iceberg catalog is tightly integrated into Dremio, offering a streamlined and feature-rich experience for managing data across cloud and on-prem environments.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Features of the Hybrid Catalog&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Integrated Table Maintenance&lt;/strong&gt;: Automate table maintenance tasks such as compaction, cleanup, and optimization, ensuring that tables remain performant with minimal user intervention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Location Cataloging&lt;/strong&gt;: Seamlessly manage and catalog tables across diverse storage environments, including multiple cloud providers and on-premises storage solutions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Polaris-Based Capabilities&lt;/strong&gt;: Leverage the powerful features of the Polaris catalog, including RBAC, external catalogs, and potential catalog versioning (if implemented by Polaris).&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;&lt;strong&gt;Benefits of the Dremio Hybrid Catalog&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplified Data Management&lt;/strong&gt;: Provides a unified interface for managing Iceberg tables across different environments, reducing complexity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Performance&lt;/strong&gt;: Automated maintenance and cleanup ensure tables are always optimized for fast and efficient queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility and Scalability&lt;/strong&gt;: Supports hybrid architectures, allowing organizations to manage data wherever it resides without sacrificing control or performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Impact on the Iceberg Ecosystem&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The general availability of the Dremio Hybrid Catalog will mark a significant milestone for organizations adopting Iceberg. By integrating Polaris&apos; advanced capabilities into a managed catalog, Dremio is poised to deliver a seamless and efficient solution for managing data lakehouse environments. This innovation underscores Dremio&apos;s commitment to making Iceberg a cornerstone of modern data management strategies.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;As we look ahead to 2025, the Apache Iceberg ecosystem is set to deliver groundbreaking advancements that will transform how organizations manage and analyze their data. From enhanced query optimization with scan planning endpoints and materialized views to broader support for geospatial and semi-structured data, Iceberg continues to push the boundaries of data lakehouse capabilities. Exciting developments like the Dremio Hybrid Catalog and updates to delete file specifications promise to make Iceberg even more efficient, scalable, and interoperable.&lt;/p&gt;
&lt;p&gt;These innovations highlight the vibrant community driving Apache Iceberg and the collective effort to address the evolving needs of modern data platforms. Whether you&apos;re leveraging Iceberg for its robust cataloging features, seamless multi-cloud support, or cutting-edge query capabilities, 2025 is shaping up to be a year of remarkable growth and opportunity. Stay tuned as Apache Iceberg continues to lead the way in open data lakehouse technology, empowering organizations to unlock the full potential of their data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Deep Dive into Dremio&apos;s File-based Auto Ingestion into Apache Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2024-11-deep-dive-auto-ingest-dremio-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-deep-dive-auto-ingest-dremio-iceberg/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Fri, 15 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Manually orchestrating data pipelines to handle ever-increasing volumes of data can be both time-consuming and error-prone. Enter &lt;strong&gt;Dremio Auto-Ingest&lt;/strong&gt;, a game-changing feature that simplifies the process of loading data into &lt;strong&gt;Apache Iceberg&lt;/strong&gt; tables.&lt;/p&gt;
&lt;p&gt;With Auto-Ingest, you can create event-driven pipelines that automatically respond to changes in your object storage systems, such as new files being uploaded to Amazon S3. This approach eliminates the need for constant manual intervention, enabling real-time or near-real-time updates to your Iceberg tables. Whether you’re ingesting structured CSV data, semi-structured JSON files, or compact Parquet formats, Dremio Auto-Ingest ensures a seamless, reliable pipeline.&lt;/p&gt;
&lt;p&gt;But why choose Auto-Ingest over traditional methods? The answer lies in its ability to handle ingestion challenges like deduplication, error handling, and custom formatting, all while integrating smoothly with modern cloud infrastructure.&lt;/p&gt;
&lt;h2&gt;Understanding Auto-Ingest for Apache Iceberg&lt;/h2&gt;
&lt;p&gt;To fully appreciate the power of Dremio Auto-Ingest, it’s important to understand the core components and how they work together. At its heart, Auto-Ingest is designed to create a seamless pipeline that transfers files from object storage into &lt;strong&gt;Apache Iceberg tables&lt;/strong&gt; with minimal manual intervention. Let’s break it down.&lt;/p&gt;
&lt;h3&gt;What is a Pipe Object?&lt;/h3&gt;
&lt;p&gt;The &lt;strong&gt;pipe object&lt;/strong&gt; is the central feature enabling Auto-Ingest. Think of it as a pre-configured connection between your cloud storage and an Iceberg table. The pipe listens for events, such as the arrival of a new file, and automatically triggers the ingestion process. This eliminates the need for periodic manual data loads or complex batch scripts.&lt;/p&gt;
&lt;p&gt;Here’s what makes a pipe object powerful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Notification Provider&lt;/strong&gt;: Specifies the mechanism for event detection, such as AWS SQS for Amazon S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Queue Reference&lt;/strong&gt;: Points to the event queue where file changes are registered.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deduplication&lt;/strong&gt;: Ensures no duplicate files are ingested, even if files are re-uploaded or processed multiple times.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible Configuration&lt;/strong&gt;: Allows you to define file formats, custom settings, and error-handling rules.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How Does Auto-Ingest Work?&lt;/h3&gt;
&lt;p&gt;Auto-Ingest leverages an &lt;strong&gt;event-driven model&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A file is added or updated in the storage location (e.g., an S3 bucket).&lt;/li&gt;
&lt;li&gt;A notification is sent to the queue specified in the pipe configuration.&lt;/li&gt;
&lt;li&gt;The pipe detects the notification and triggers the ingestion process using the &lt;code&gt;COPY INTO&lt;/code&gt; command to move data into the Iceberg table.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach is both reactive and efficient, ensuring that your data remains fresh without the overhead of constant polling or manual triggers.&lt;/p&gt;
&lt;h3&gt;Benefits of Using Auto-Ingest&lt;/h3&gt;
&lt;p&gt;Why choose Auto-Ingest for your Iceberg tables? Here are some key benefits:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Updates&lt;/strong&gt;: Ensure your Iceberg tables always reflect the latest data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplified Pipeline Management&lt;/strong&gt;: Replace complex, custom ingestion scripts with a single declarative configuration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Quality Assurance&lt;/strong&gt;: Built-in deduplication and error-handling mechanisms help maintain clean, accurate datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Auto-Ingest works seamlessly with cloud-native object storage, enabling pipelines that scale with your data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By combining the power of Apache Iceberg with Dremio’s Auto-Ingest, you can build modern, efficient pipelines that support both analytical and operational workloads with ease.&lt;/p&gt;
&lt;h2&gt;Step-by-Step Guide: Setting Up Auto-Ingest&lt;/h2&gt;
&lt;p&gt;By following these steps, you can automate data ingestion from cloud storage and ensure seamless integration with your data lakehouse.&lt;/p&gt;
&lt;h3&gt;1. Prerequisites&lt;/h3&gt;
&lt;p&gt;Before creating an Auto-Ingest pipeline, ensure the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cloud Storage Setup&lt;/strong&gt;: Configure your storage location (e.g., Amazon S3) as a source in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Service&lt;/strong&gt;: Set up an event notification provider, such as AWS SQS, to monitor changes in the storage location.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg Table&lt;/strong&gt;: Ensure the target table exists and is properly configured in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supported File Formats&lt;/strong&gt;: Verify that your files are in one of the supported formats: CSV, JSON, or Parquet.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Creating a Pipe Object&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;CREATE PIPE&lt;/code&gt; command is the foundation of the Auto-Ingest setup. It connects your storage location to an Iceberg table, specifying ingestion parameters.&lt;/p&gt;
&lt;h4&gt;Syntax&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE [ IF NOT EXISTS ] &amp;lt;pipe_name&amp;gt;
  [ DEDUPE_LOOKBACK_PERIOD &amp;lt;number_of_days&amp;gt; ]
  NOTIFICATION_PROVIDER &amp;lt;notification_provider&amp;gt;
  NOTIFICATION_QUEUE_REFERENCE &amp;lt;notification_queue_ref&amp;gt;
  AS COPY INTO &amp;lt;table_name&amp;gt;
    [ AT BRANCH &amp;lt;branch_name&amp;gt; ]
    FROM &apos;@&amp;lt;storage_location_name&amp;gt;&apos;
    FILE_FORMAT &apos;&amp;lt;format&amp;gt;&apos;
    [(&amp;lt;format_options&amp;gt;)]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Key Parameters&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;DEDUPE_LOOKBACK_PERIOD:&lt;/code&gt;&lt;/strong&gt; Defines the time window (in days) for deduplication. Default is 14 days.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;NOTIFICATION_PROVIDER:&lt;/code&gt;&lt;/strong&gt; Specifies the event notification system, such as AWS_SQS for Amazon S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;NOTIFICATION_QUEUE_REFERENCE:&lt;/code&gt;&lt;/strong&gt; Points to the notification queue (e.g., the ARN of an SQS queue).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;COPY INTO:&lt;/code&gt;&lt;/strong&gt; Specifies the target Iceberg table and optional branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;@&amp;lt;storage_location_name&amp;gt;:&lt;/code&gt;&lt;/strong&gt; Refers to the source storage location configured in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Format Options:&lt;/strong&gt; Custom configurations for CSV, JSON, or Parquet files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Examples&lt;/h4&gt;
&lt;p&gt;Basic Pipe for CSV Files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE my_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:my-queue&apos;
  AS COPY INTO sales_data
    FROM &apos;@s3_source/data_folder&apos;
    FILE_FORMAT &apos;csv&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pipe with Deduplication&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE deduped_pipe
  DEDUPE_LOOKBACK_PERIOD 7
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:dedupe-queue&apos;
  AS COPY INTO analytics_table
    FROM &apos;@s3_source/analytics&apos;
    FILE_FORMAT &apos;parquet&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Customizing File Formats&lt;/h3&gt;
&lt;p&gt;Dremio allows you to tailor the ingestion process based on your file type and data requirements. Here’s how to configure each format:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CSV Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Delimiters (&lt;code&gt;FIELD_DELIMITER&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null handling (&lt;code&gt;EMPTY_AS_NULL&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Header extraction (&lt;code&gt;EXTRACT_HEADER&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error handling (&lt;code&gt;ON_ERROR&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;JSON Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Date and time formatting (&lt;code&gt;DATE_FORMAT&lt;/code&gt;, &lt;code&gt;TIME_FORMAT&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null replacements (&lt;code&gt;NULL_IF&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Parquet Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplified setup with error handling (&lt;code&gt;ON_ERROR&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example for CSV with custom settings:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE custom_csv_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:csv-queue&apos;
  AS COPY INTO transactions_table
    FROM &apos;@s3_source/csv_data&apos;
    FILE_FORMAT &apos;csv&apos;
    (FIELD_DELIMITER &apos;|&apos;, EXTRACT_HEADER &apos;true&apos;, ON_ERROR &apos;skip_file&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Error Handling&lt;/h3&gt;
&lt;p&gt;Errors during ingestion are inevitable, but Dremio’s Auto-Ingest provides robust handling options:&lt;/p&gt;
&lt;h4&gt;ON_ERROR Options:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;abort:&lt;/strong&gt; Stops the process at the first error (default for JSON and Parquet).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;continue:&lt;/strong&gt; Skips faulty rows but processes valid ones (CSV only).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;skip_file:&lt;/strong&gt; Skips the entire file if any error occurs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE error_handling_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:error-queue&apos;
  AS COPY INTO error_log_table
    FROM &apos;@s3_source/faulty_data&apos;
    FILE_FORMAT &apos;json&apos;
    (ON_ERROR &apos;skip_file&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With your pipe configured, Dremio automatically monitors your storage for changes and ingests new files into the target Iceberg table. This setup provides a scalable, reliable pipeline for all your data ingestion needs.&lt;/p&gt;
&lt;h2&gt;Real-World Use Cases for Dremio Auto-Ingest&lt;/h2&gt;
&lt;p&gt;Dremio’s Auto-Ingest for Apache Iceberg tables offers significant advantages across a variety of data engineering scenarios. Whether you’re building real-time pipelines or automating batch data processing, Auto-Ingest provides the flexibility and automation necessary to simplify workflows. Here are some real-world use cases to illustrate its impact.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;1. &lt;strong&gt;Streaming Data Pipelines&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: A smart city project collects real-time sensor data (e.g., temperature, traffic flow, air quality) from IoT devices. This data is stored as JSON files in an S3 bucket, and analytics teams require instant updates in their data warehouse for real-time dashboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Dremio Auto-Ingest with a pipe object that listens to the S3 bucket.&lt;/li&gt;
&lt;li&gt;Configure the pipe to process JSON files and load them into an Iceberg table.&lt;/li&gt;
&lt;li&gt;Leverage &lt;code&gt;ON_ERROR&lt;/code&gt; settings to gracefully handle malformed sensor data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example Configuration&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE streaming_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-west-2:123456789012:sensor-queue&apos;
  AS COPY INTO smart_city.sensor_data
    FROM &apos;@iot_source/live_data&apos;
    FILE_FORMAT &apos;json&apos;
    (ON_ERROR &apos;skip_file&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Real-time dashboards reflect the latest sensor data without manual intervention.&lt;/li&gt;
&lt;li&gt;Faulty data is isolated for later analysis, ensuring system stability.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Batch Data Processing&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A retail company ingests daily sales logs in CSV format from its regional branches into a central data lake. These logs must be processed nightly and appended to a historical sales Iceberg table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Configure an Auto-Ingest pipe to monitor the S3 bucket where sales logs are uploaded.&lt;/li&gt;
&lt;li&gt;Set a deduplication lookback period to avoid reprocessing files if logs are accidentally re-uploaded.
Example Configuration:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE daily_batch_pipe
  DEDUPE_LOOKBACK_PERIOD 7
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:sales-queue&apos;
  AS COPY INTO retail.sales_history
    FROM &apos;@s3_source/sales_logs&apos;
    FILE_FORMAT &apos;csv&apos;
    (EXTRACT_HEADER &apos;true&apos;, EMPTY_AS_NULL &apos;true&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Daily sales logs are automatically appended to the historical table.&lt;/li&gt;
&lt;li&gt;The deduplication window ensures no duplicate records are ingested.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Data Lakehouse Modernization&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A financial services firm is transitioning from a traditional data warehouse to a modern lakehouse architecture. The team wants to automate ingestion from various sources (e.g., transactional Parquet files and JSON logs) into Iceberg tables for unified analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Use multiple Auto-Ingest pipes to handle ingestion for different file types and schemas.
Configure branch-specific ingestion for staging and production environments.
Example Configuration:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet Transactions:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE transactions_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-2:123456789012:transactions-queue&apos;
  AS COPY INTO finance.transactions
    FROM &apos;@finance_source/transactions&apos;
    FILE_FORMAT &apos;parquet&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;JSON Application Logs:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;Copy code
CREATE PIPE logs_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-2:123456789012:logs-queue&apos;
  AS COPY INTO finance.app_logs
    FROM &apos;@logs_source/application&apos;
    FILE_FORMAT &apos;json&apos;
    (DATE_FORMAT &apos;YYYY-MM-DD&apos;, TIME_FORMAT &apos;HH24:MI:SS&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unified, structured Iceberg tables ready for analytical queries.&lt;/li&gt;
&lt;li&gt;Improved agility with automated pipelines for different data sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Event-Driven Reporting&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A marketing team tracks user engagement metrics (e.g., clicks, time on site, purchases) stored as CSV files in real-time. Reports must be updated immediately after new data arrives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Use an Auto-Ingest pipe with an AWS_SQS notification provider to ensure new engagement files are ingested as soon as they are uploaded.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example Configuration:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE engagement_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-west-1:123456789012:engagement-queue&apos;
  AS COPY INTO marketing.user_engagement
    FROM &apos;@engagement_source/metrics&apos;
    FILE_FORMAT &apos;csv&apos;
    (FIELD_DELIMITER &apos;,&apos;, EXTRACT_HEADER &apos;true&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Marketing reports are updated in near-real-time, enabling faster decision-making.&lt;/li&gt;
&lt;li&gt;Automated ingestion removes the need for manual ETL processes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These use cases showcase how Dremio Auto-Ingest can be a versatile and powerful tool for a wide range of data engineering challenges. Whether your focus is on real-time data processing, batch workflows, or transitioning to a lakehouse architecture, Auto-Ingest simplifies and enhances your pipeline capabilities.&lt;/p&gt;
&lt;h2&gt;Best Practices and Considerations for Dremio Auto-Ingest&lt;/h2&gt;
&lt;p&gt;To get the most out of Dremio Auto-Ingest for Apache Iceberg tables, it&apos;s essential to follow best practices and understand key considerations. These guidelines will help ensure your ingestion pipelines are reliable, efficient, and optimized for performance.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Optimize Deduplication Settings&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What It Does&lt;/strong&gt;: The &lt;code&gt;DEDUPE_LOOKBACK_PERIOD&lt;/code&gt; parameter ensures that duplicate files (e.g., files with the same name uploaded multiple times) are not ingested repeatedly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set an appropriate lookback period based on your ingestion frequency:
&lt;ul&gt;
&lt;li&gt;For high-frequency updates (e.g., hourly ingestion), a shorter period (1–3 days) is sufficient.&lt;/li&gt;
&lt;li&gt;For batch workflows with periodic reuploads, a longer window (7–14 days) may be needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Avoid setting the period to &lt;code&gt;0&lt;/code&gt; unless you are certain duplicates are not an issue, as it disables deduplication.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE deduped_pipe
  DEDUPE_LOOKBACK_PERIOD 7
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:dedupe-queue&apos;
  AS COPY INTO my_table
    FROM &apos;@s3_source/folder&apos;
    FILE_FORMAT &apos;json&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Organize Storage for Better Performance&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Properly structured storage locations improve ingestion speed and reduce processing overhead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use folder-based organization in your storage buckets (e.g., &lt;code&gt;/year/month/day/&lt;/code&gt;) for easier file management and regex-based ingestion.&lt;/li&gt;
&lt;li&gt;Keep related files in the same folder to avoid ingesting unrelated data by mistake.&lt;/li&gt;
&lt;li&gt;Avoid deeply nested directory structures, as they can slow down file scanning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Choose the Right File Format&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Impact of File Format:&lt;/strong&gt; Different file formats affect storage size, query performance, and ingestion speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Parquet for columnar storage and analytics-heavy workloads due to its efficient storage and compression.&lt;/li&gt;
&lt;li&gt;Opt for CSV or JSON for semi-structured data but ensure proper formatting (e.g., consistent delimiters, headers, and escaping).&lt;/li&gt;
&lt;li&gt;Test ingestion performance with small sample files before committing to large-scale pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Leverage Error Handling Options&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Errors during ingestion can interrupt pipelines or lead to data inconsistencies.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;ON_ERROR &apos;skip_file&apos;&lt;/code&gt; to bypass files with errors and prevent pipeline interruptions.&lt;/li&gt;
&lt;li&gt;Regularly monitor the &lt;code&gt;sys.copy_errors_history&lt;/code&gt; table for ingestion errors and address recurring issues.&lt;/li&gt;
&lt;li&gt;For non-critical pipelines, consider &lt;code&gt;ON_ERROR &apos;continue&apos;&lt;/code&gt; (CSV only) to process valid rows even if some are faulty.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE error_handling_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:error-queue&apos;
  AS COPY INTO my_table
    FROM &apos;@s3_source/folder&apos;
    FILE_FORMAT &apos;csv&apos;
    (ON_ERROR &apos;continue&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. Monitor and Troubleshoot Pipelines&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Monitoring Tools:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;System Tables:&lt;/strong&gt; Query sys.copy_errors_history to review errors during ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Job Logs:&lt;/strong&gt; Check job logs in Dremio for detailed error messages and ingestion stats.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Common Troubleshooting Tips:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Notification Issues:&lt;/strong&gt; Ensure the SQS queue ARN matches the one specified in the &lt;code&gt;NOTIFICATION_QUEUE_REFERENCE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Format Mismatches:&lt;/strong&gt; Double-check that the specified file format aligns with the actual file type (e.g., don’t label a Parquet file as CSV).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deduplication Failures:&lt;/strong&gt; Verify that the deduplication period is set correctly and files aren’t inadvertently re-ingested due to naming conflicts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Optimize Regex and File Selection&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Using overly broad regex patterns or processing unnecessary files can impact pipeline performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Write regex patterns that are as specific as possible to match only the files you need.
Avoid processing large directories unless required. Use the &lt;code&gt;FILES&lt;/code&gt; clause or specific folder paths to limit scope.
Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE regex_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-west-2:123456789012:regex-queue&apos;
  AS COPY INTO my_table
    FROM &apos;@s3_source/folder&apos;
    REGEX &apos;^2024/11/.*.csv&apos;
    FILE_FORMAT &apos;csv&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;7. Plan for Schema Evolution&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Iceberg tables support schema evolution, but it’s crucial to manage changes thoughtfully to avoid ingestion failures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Test schema changes in a staging environment before applying them to production pipelines.&lt;/li&gt;
&lt;li&gt;Use Iceberg’s branching capabilities to isolate schema updates during development.&lt;/li&gt;
&lt;li&gt;Validate data types and formats in source files to avoid mismatches with the target table schema.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;8. Integrate with Data Lakehouse Workflows&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Auto-Ingest simplifies transitioning to a lakehouse architecture, but aligning with broader workflows ensures smooth integration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Combine Auto-Ingest with Dremio’s SQL-based querying to enable seamless analytics on ingested data.&lt;/li&gt;
&lt;li&gt;Use Iceberg’s time-travel feature to track historical changes and validate pipeline performance over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By following these best practices and considerations, you can ensure your Dremio Auto-Ingest pipelines are robust, efficient, and well-suited to your data engineering needs. These guidelines will help you avoid common pitfalls and fully leverage the power of automated ingestion for Apache Iceberg tables.&lt;/p&gt;
&lt;h2&gt;Troubleshooting and Debugging Auto-Ingest Pipelines&lt;/h2&gt;
&lt;p&gt;Even with a robust Auto-Ingest setup, you may encounter issues during the ingestion process. Dremio’s system tables, such as &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt;, provide detailed insights into ingestion errors, making it easier to diagnose and resolve problems. This section outlines common issues and how to effectively use the system table to debug your pipelines.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Common Issues and Resolutions&lt;/strong&gt;&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;Notification Configuration Problems&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The pipe does not respond to new files being uploaded.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Verify the &lt;code&gt;NOTIFICATION_PROVIDER&lt;/code&gt; is configured correctly (e.g., &lt;code&gt;AWS_SQS&lt;/code&gt; for S3).&lt;/li&gt;
&lt;li&gt;Ensure the &lt;code&gt;NOTIFICATION_QUEUE_REFERENCE&lt;/code&gt; points to the correct ARN of your event notification queue.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;File Format Mismatch&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The pipeline fails with file parsing errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Double-check that the &lt;code&gt;FILE_FORMAT&lt;/code&gt; in your pipe configuration matches the actual format of the uploaded files.&lt;/li&gt;
&lt;li&gt;Validate format-specific options (e.g., delimiter, null handling) for correctness.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Partial or Skipped File Loads&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: Some files are partially loaded or skipped entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Use the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table to identify problematic files and the reasons for rejection.&lt;/li&gt;
&lt;li&gt;Adjust error-handling options (&lt;code&gt;ON_ERROR&lt;/code&gt;) in your pipe to match your tolerance for bad records.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Using the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; Table&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table logs detailed information about &lt;code&gt;COPY INTO&lt;/code&gt; jobs where records were rejected due to parsing or schema issues. This includes jobs configured with &lt;code&gt;ON_ERROR &apos;continue&apos;&lt;/code&gt; or &lt;code&gt;ON_ERROR &apos;skip_file&apos;&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Columns in the Table&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;executed_at&lt;/code&gt;&lt;/strong&gt;: The timestamp when the job was executed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;job_id&lt;/code&gt;&lt;/strong&gt;: The unique identifier of the &lt;code&gt;COPY INTO&lt;/code&gt; job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;table_name&lt;/code&gt;&lt;/strong&gt;: The target Iceberg table for the job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;user_name&lt;/code&gt;&lt;/strong&gt;: The username of the individual who ran the job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;file_path&lt;/code&gt;&lt;/strong&gt;: The path of the file with rejected records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;file_state&lt;/code&gt;&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;PARTIALLY_LOADED&lt;/code&gt;: Some records were loaded, but others were rejected.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SKIPPED&lt;/code&gt;: No records were loaded due to file-level errors.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;records_loaded_count&lt;/code&gt;&lt;/strong&gt;: The number of successfully ingested records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;records_rejected_count&lt;/code&gt;&lt;/strong&gt;: The number of records rejected due to errors.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Example Query: Identifying Problematic Files&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;To view details about rejected files for a specific table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT executed_at, job_id, file_path, file_state, records_rejected_count
FROM SYS.COPY_ERRORS_HISTORY
WHERE table_name = &apos;my_table&apos;
ORDER BY executed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;This query highlights:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When ingestion errors occurred.&lt;/li&gt;
&lt;li&gt;Which files were affected.&lt;/li&gt;
&lt;li&gt;Whether files were partially loaded or skipped.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Drilling into Error Details&lt;/h3&gt;
&lt;p&gt;Once you identify a problematic job using the job_id, you can use the &lt;code&gt;copy_errors()&lt;/code&gt; function to extract detailed error information.&lt;/p&gt;
&lt;p&gt;Example: Retrieving Error Details&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT *
FROM copy_errors(&apos;1aacb195-ca94-ec4c-2b01-ecddac81a900&apos;, &apos;my_table&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query provides granular information about errors encountered during the ingestion process for the specified job.&lt;/p&gt;
&lt;h3&gt;4. Best Practices for Debugging&lt;/h3&gt;
&lt;h4&gt;Proactive Monitoring&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Regularly query the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table to track ingestion health.&lt;/li&gt;
&lt;li&gt;Set alerts for high records_rejected_count values to identify recurring issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Validate Source Data&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Audit source files for schema inconsistencies or formatting errors.&lt;/li&gt;
&lt;li&gt;Ensure files match the expected format (e.g., proper delimiters for CSV).&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Tuning Error Handling&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;ON_ERROR &apos;skip_file&apos;&lt;/code&gt; for critical pipelines where partial loads are unacceptable.&lt;/li&gt;
&lt;li&gt;Opt for &lt;code&gt;ON_ERROR &apos;continue&apos;&lt;/code&gt; in cases where maximum data recovery is desired, especially for CSV files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Housekeeping the System Table&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table can grow significantly over time. Manage its size using these configuration keys:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dremio.system_iceberg_tables.record_lifespan_in_millis&lt;/code&gt;: Retains history for a specified number of days (default is 7).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dremio.system_iceberg_tables.housekeeping_thread_frequency_in_millis&lt;/code&gt;: Controls how frequently old records are removed (default is daily).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Common Query Patterns for Debugging&lt;/h3&gt;
&lt;p&gt;Find Recently Skipped Files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT file_path, file_state, records_rejected_count
FROM SYS.COPY_ERRORS_HISTORY
WHERE file_state = &apos;SKIPPED&apos;
ORDER BY executed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Analyze Partially Loaded Files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT file_path, records_loaded_count, records_rejected_count
FROM SYS.COPY_ERRORS_HISTORY
WHERE file_state = &apos;PARTIALLY_LOADED&apos;
ORDER BY executed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By leveraging the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table and related debugging tools, you can effectively monitor and resolve issues in your Auto-Ingest pipelines. These capabilities ensure your pipelines are resilient and capable of handling a wide variety of data ingestion scenarios with minimal disruption.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Dremio Auto-Ingest for Apache Iceberg tables brings a new level of automation and simplicity to data ingestion workflows. By leveraging event-driven pipelines, you can reduce manual intervention, ensure data freshness, and streamline the integration of your object storage systems with Iceberg tables.&lt;/p&gt;
&lt;p&gt;From real-time updates to batch processing, Auto-Ingest handles diverse use cases with ease, offering powerful features like deduplication, error handling, and format-specific customization. By following best practices, monitoring your pipelines, and troubleshooting effectively, you can create reliable and efficient data ingestion workflows that scale with your business needs.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re modernizing your data lakehouse architecture or building advanced analytics pipelines, Dremio Auto-Ingest is a must-have tool to unlock the full potential of your data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Intro to SQL using Apache Iceberg and Dremio</title><link>https://iceberglakehouse.com/posts/2024-11-intro-to-sql-with-dremio-and-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-intro-to-sql-with-dremio-and-apache-iceberg/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Fri, 08 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;SQL (Structured Query Language) has long been the standard for interacting with data, providing a powerful and accessible language for data querying and manipulation. However, traditional data warehouses and databases often fall short when dealing with the scale and flexibility demanded by modern data workloads.&lt;/p&gt;
&lt;p&gt;This is where Apache Iceberg and Dremio come in. Apache Iceberg is an open table format designed for large-scale data lakes, enabling reliable data management with features like ACID transactions, schema evolution, and time-travel. Iceberg brings structure and governance to data lakes, making them more capable of handling enterprise data needs. Dremio, on the other hand, is a data lakehouse platform that brings SQL querying capabilities to data lakes, providing a unified interface to query and analyze data across various sources.&lt;/p&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll understand the basics of SQL in Dremio and how to perform essential data operations with Apache Iceberg tables.&lt;/p&gt;
&lt;h2&gt;What is SQL, Apache Iceberg, and Dremio, and Why They Matter&lt;/h2&gt;
&lt;h3&gt;What is SQL?&lt;/h3&gt;
&lt;p&gt;SQL, or Structured Query Language, is a language specifically designed for managing and querying data in relational databases. Its versatility and power make it ideal for a wide range of data operations, including data extraction, aggregation, and transformation. SQL&apos;s widespread use in data analysis and reporting has made it a cornerstone in the world of data management.&lt;/p&gt;
&lt;h3&gt;What is Apache Iceberg?&lt;/h3&gt;
&lt;p&gt;Apache Iceberg is an open-source table format that brings structure and governance to data lakes. Designed with scalability in mind, Iceberg offers features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions&lt;/strong&gt;: Ensuring data consistency across large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time-Travel&lt;/strong&gt;: Querying historical versions of data, which is essential for audits and analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Modifying table schemas without disrupting ongoing operations.
Iceberg’s approach to data management provides a reliable foundation for large-scale analytics and data processing, making it a valuable component in any data lakehouse architecture.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What is Dremio?&lt;/h3&gt;
&lt;p&gt;Dremio is a data lakehouse platform that unifies data access, enabling users to perform SQL queries across data lakes, warehouses, and other data sources through a single, user-friendly interface. Dremio simplifies data analytics by providing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified Semantic Layer&lt;/strong&gt;: Organizes and documents datasets for easier discovery and analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Support for Apache Iceberg&lt;/strong&gt;: Seamless integration with Iceberg tables, allowing users to query and manipulate large datasets with SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Versioning and Governance&lt;/strong&gt;: Through integrations with Nessie, Dremio supports versioned, Git-like data management, making it ideal for maintaining data accuracy and history.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why They Matter Together&lt;/h3&gt;
&lt;p&gt;When combined, SQL, Apache Iceberg, and Dremio offer a powerful solution for data management and analysis. SQL provides the querying foundation, Apache Iceberg delivers the scalability and governance, and Dremio brings everything together in a streamlined, accessible environment. For businesses looking to harness the full potential of their data lakes, this stack delivers efficient querying, advanced data governance, and high performance.&lt;/p&gt;
&lt;p&gt;Let&apos;s set up an environment to work with these tools and walk through practical examples of using SQL with Apache Iceberg tables in Dremio.&lt;/p&gt;
&lt;h2&gt;Setting Up an Environment with Dremio, Nessie, and MinIO with Docker Compose&lt;/h2&gt;
&lt;p&gt;To start working with Apache Iceberg and Dremio, we&apos;ll set up a local environment using Docker Compose, a tool that allows us to configure and manage multiple containers with a single file. In this setup, we&apos;ll use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; as the query engine for our data lakehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt; as the catalog for versioned data management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt; as S3-compatible storage to hold our data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This environment will give us a powerful foundation to perform SQL operations on Apache Iceberg tables with Dremio.&lt;/p&gt;
&lt;h3&gt;Prerequisites&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Docker&lt;/strong&gt;: Ensure Docker is installed on your machine. You can download it from &lt;a href=&quot;https://www.docker.com/&quot;&gt;Docker&apos;s official website&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Docker Compose&lt;/strong&gt;: Typically included with Docker Desktop on Windows and macOS; on Linux, it may require separate installation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 1: Create a Docker Compose File&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Open a text editor of your choice (such as VS Code, Notepad, or Sublime Text).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create a new file named &lt;code&gt;docker-compose.yml&lt;/code&gt; in a new, empty folder. This file will define the services and configurations needed for our environment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Copy and paste the following configuration into &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &amp;quot;3&amp;quot;

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      - iceberg
    ports:
      - 19120:19120

  # MinIO Storage Server
  ## Creates two buckets named lakehouse and lake
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    networks:
      - iceberg
    ports:
      - 9001:9001
      - 9000:9000
    command: [&amp;quot;server&amp;quot;, &amp;quot;/data&amp;quot;, &amp;quot;--console-address&amp;quot;, &amp;quot;:9001&amp;quot;]
    entrypoint: &amp;gt;
      /bin/sh -c &amp;quot;
      minio server /data --console-address &apos;:9001&apos; &amp;amp;
      sleep 5 &amp;amp;&amp;amp;
      mc alias set myminio http://localhost:9000 admin password &amp;amp;&amp;amp;
      mc mb myminio/lakehouse &amp;amp;&amp;amp;
      mc mb myminio/lake &amp;amp;&amp;amp;
      tail -f /dev/null
      &amp;quot;

  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      - iceberg

networks:
  iceberg:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Explanation of the Services&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;: Acts as the catalog for Iceberg tables, providing version control for data through branching and merging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt;: Stores data in buckets, simulating an S3-compatible environment. We configure two buckets, &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;lake&lt;/code&gt;, to separate structured Iceberg data from raw data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt;: The engine for querying data stored in Iceberg tables on MinIO. Dremio will allow us to use SQL for managing and analyzing our data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Start the Environment&lt;/h3&gt;
&lt;p&gt;With the &lt;code&gt;docker-compose.yml&lt;/code&gt; file ready, follow these steps to launch the environment:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Open a terminal (Command Prompt, PowerShell, or terminal app) and navigate to the folder where you saved &lt;code&gt;docker-compose.yml&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run the following command to start all services in detached mode:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Wait a few moments for the services to initialize. You can check if the services are running by using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker ps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command should list &lt;code&gt;nessie&lt;/code&gt;, &lt;code&gt;minio&lt;/code&gt;, and &lt;code&gt;dremio&lt;/code&gt; as running containers.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Verify Each Service&lt;/h3&gt;
&lt;p&gt;After starting the containers, verify that each service is accessible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt;: Open a browser and go to &lt;code&gt;http://localhost:9047&lt;/code&gt;. You should see the Dremio login screen.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt;: In a new browser tab, go to &lt;code&gt;http://localhost:9001&lt;/code&gt;. Log in with the username &lt;code&gt;admin&lt;/code&gt; and password &lt;code&gt;password&lt;/code&gt; to access the MinIO console.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;: Nessie doesn’t have a direct UI in this setup, but you can interact with it through Dremio, as we’ll cover in later sections.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Optional - Shutting Down the Environment&lt;/h3&gt;
&lt;p&gt;To stop the environment when you&apos;re done, run the following command in the same folder as your &lt;code&gt;docker-compose.yml&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down -v
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command stops and removes all containers and associated volumes, allowing you to start fresh next time.&lt;/p&gt;
&lt;p&gt;With our environment up and running, we’re ready to start using Dremio to create and manage Apache Iceberg tables. In the next section, we’ll explore how to connect Nessie to Dremio and begin querying our data.&lt;/p&gt;
&lt;h2&gt;Accessing Dremio and Connecting Nessie&lt;/h2&gt;
&lt;p&gt;Now that our environment is up and running, let’s connect to Dremio, which will act as our query engine, and configure Nessie as a source catalog. This setup will allow us to take advantage of Apache Iceberg’s versioned data management and perform SQL operations in a streamlined, unified environment.&lt;/p&gt;
&lt;h3&gt;Step 1: Accessing Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open Dremio in Your Browser&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to &lt;code&gt;http://localhost:9047&lt;/code&gt; in your browser. You should see the Dremio login screen.&lt;/li&gt;
&lt;li&gt;If this is your first time setting up Dremio, you may need to create an admin user. Follow the on-screen instructions to set up your login credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Familiarize Yourself with Dremio’s Interface&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After logging in, explore Dremio’s main interface. Key areas include:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL Runner&lt;/strong&gt;: Where you can run SQL queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Datasets&lt;/strong&gt;: A section for browsing and managing tables, views, and sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jobs&lt;/strong&gt;: A log of executed queries and their performance metrics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The SQL Runner will be our primary workspace for running queries and interacting with Apache Iceberg tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Connecting Nessie as a Catalog in Dremio&lt;/h3&gt;
&lt;p&gt;Nessie acts as the catalog for our Iceberg tables, enabling us to manage data with version control features such as branching and merging. Let’s add Nessie as a source in Dremio.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a New Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In Dremio, click on the &lt;strong&gt;Add Source&lt;/strong&gt; button in the lower left corner of the interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the Nessie Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Select &lt;strong&gt;Nessie&lt;/strong&gt; from the list of source types.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enter Nessie Connection Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Enter a name for the source, such as &lt;code&gt;lakehouse&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Endpoint URL&lt;/strong&gt;: Enter the endpoint for the Nessie API:&lt;pre&gt;&lt;code&gt;http://nessie:19120/api/v2
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Choose &lt;strong&gt;None&lt;/strong&gt; (since Nessie is running locally and does not require additional credentials in this setup).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage Settings&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Enter &lt;code&gt;admin&lt;/code&gt; (the MinIO username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Enter &lt;code&gt;password&lt;/code&gt; (the MinIO password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Enter &lt;code&gt;lakehouse&lt;/code&gt; (this is the bucket where our Iceberg tables will be stored).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dremio.s3.compat&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option, as we’re running Nessie locally over HTTP.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After filling out all the fields, click &lt;strong&gt;Save&lt;/strong&gt;. Dremio will now connect to the Nessie catalog, and you’ll see &lt;code&gt;lakehouse&lt;/code&gt; (or the name you assigned) listed in the &lt;strong&gt;Datasets&lt;/strong&gt; section of Dremio’s interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Adding MinIO as an S3 Source in Dremio&lt;/h3&gt;
&lt;p&gt;In addition to Nessie, we can add MinIO as a general S3-compatible source in Dremio. This source allows us to access raw data files stored in the MinIO &lt;code&gt;lake&lt;/code&gt; bucket, enabling direct SQL queries on various file types (e.g., JSON, CSV, Parquet) without the need to define tables.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a New Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Click the &lt;strong&gt;Add Source&lt;/strong&gt; button in Dremio again, then select &lt;strong&gt;S3&lt;/strong&gt; as the source type.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the MinIO Connection&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Enter a name like &lt;code&gt;lake&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials&lt;/strong&gt;: Choose &lt;strong&gt;AWS access key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Enter &lt;code&gt;admin&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Enter &lt;code&gt;password&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option, as we’re running locally.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Options&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Enable Compatibility Mode&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt; to ensure compatibility with MinIO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;/lake&lt;/code&gt; (the bucket name for general storage).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After configuring these settings, click &lt;strong&gt;Save&lt;/strong&gt;. Dremio will connect to MinIO, and the &lt;code&gt;lake&lt;/code&gt; source will appear in the &lt;strong&gt;Datasets&lt;/strong&gt; section.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Verifying the Connections&lt;/h3&gt;
&lt;p&gt;With both sources connected, you should see &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;lake&lt;/code&gt; listed under &lt;strong&gt;Datasets&lt;/strong&gt; in Dremio. These sources provide access to structured, versioned data in the &lt;code&gt;lakehouse&lt;/code&gt; bucket and general-purpose data in the &lt;code&gt;lake&lt;/code&gt; bucket.&lt;/p&gt;
&lt;p&gt;Let&apos;s explore how to use SQL within Dremio to create tables, insert data, and perform various data operations on our Iceberg tables.&lt;/p&gt;
&lt;h2&gt;How to Create Tables with SQL&lt;/h2&gt;
&lt;p&gt;Now that our environment is configured and connected, let&apos;s dive into creating tables using SQL in Dremio. Apache Iceberg tables in Dremio allow us to take advantage of Iceberg’s powerful features, such as schema evolution and advanced partitioning.&lt;/p&gt;
&lt;h3&gt;Creating Tables with &lt;code&gt;CREATE TABLE&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;CREATE TABLE&lt;/code&gt; command in Dremio allows us to define a new Iceberg table with specific columns, data types, and optional partitioning. Below, we’ll cover the syntax and provide examples for creating tables.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;CREATE TABLE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE [IF NOT EXISTS] &amp;lt;table_name&amp;gt; (
  &amp;lt;column_name1&amp;gt; &amp;lt;data_type&amp;gt;,
  &amp;lt;column_name2&amp;gt; &amp;lt;data_type&amp;gt;,
  ...
)
[ PARTITION BY (&amp;lt;partition_transform&amp;gt;) ];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;IF NOT EXISTS&lt;/code&gt;&lt;/strong&gt;: Optionally add this clause to create the table only if it does not already exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;table_name&lt;/strong&gt;: The name of the table to be created. In our setup, you can use lakehouse.&lt;code&gt;&amp;lt;table_name&amp;gt;&lt;/code&gt; to specify the location in the Nessie catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;column_name / data_type&lt;/strong&gt;: Define each column with a name and a data type (e.g., &lt;code&gt;VARCHAR&lt;/code&gt;, &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;TIMESTAMP&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;PARTITION BY&lt;/code&gt;&lt;/strong&gt;: Specify a partitioning strategy, which is especially useful for Iceberg tables. Iceberg supports several partition transforms, such as year, month, day, bucket, and truncate.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Creating a Basic Table&lt;/h4&gt;
&lt;p&gt;Let’s create a simple table to store customer data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE lakehouse.customers (
  id INT,
  first_name VARCHAR,
  last_name VARCHAR,
  age INT
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;p&gt;We define a customers table within the lakehouse source, where each row represents a customer with an &lt;code&gt;ID&lt;/code&gt;, &lt;code&gt;first name&lt;/code&gt;, &lt;code&gt;last name&lt;/code&gt;, and &lt;code&gt;age&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Example 2: Creating a Partitioned Table&lt;/h4&gt;
&lt;p&gt;To optimize queries, we can partition the customers table by the first letter of the last_name column using the truncate transform.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE lakehouse.customers_partitioned (
  id INT,
  first_name VARCHAR,
  last_name VARCHAR,
  age INT
) PARTITION BY (truncate(1, last_name));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we use the &lt;code&gt;PARTITION BY&lt;/code&gt; clause with &lt;code&gt;truncate(1, last_name)&lt;/code&gt;, which will partition the data by the first character of the &lt;code&gt;last_name&lt;/code&gt; column. Partitioning helps to improve query performance by allowing Dremio to read only the relevant data based on query filters.&lt;/p&gt;
&lt;h4&gt;Example 3: Creating a Date-Partitioned Table&lt;/h4&gt;
&lt;p&gt;If we have a table to store order data, we may want to partition it by the date the order was placed.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE lakehouse.orders (
  order_id INT,
  customer_id INT,
  order_date DATE,
  total_amount DOUBLE
) PARTITION BY (month(order_date));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, &lt;code&gt;month(order_date)&lt;/code&gt; partitions the table by the month of the &lt;code&gt;order_date&lt;/code&gt; field, making it easier to run queries filtered by month, as Iceberg will only read the relevant partitions.&lt;/p&gt;
&lt;h3&gt;Viewing Tables in Dremio&lt;/h3&gt;
&lt;p&gt;Once the tables are created, you can view them in Dremio’s Datasets section:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Navigate to the lakehouse source in the Dremio interface.&lt;/li&gt;
&lt;li&gt;You should see the &lt;code&gt;customers&lt;/code&gt;, &lt;code&gt;customers_partitioned&lt;/code&gt;,&lt;code&gt;and&lt;/code&gt;orders` tables listed.&lt;/li&gt;
&lt;li&gt;Clicking on a table name will show you the table, and in the metadata bar on the left show the schema, documentation and other information.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now let&apos;s look at how to insert data into these tables using SQL.&lt;/p&gt;
&lt;h2&gt;How to Insert into Tables with SQL&lt;/h2&gt;
&lt;p&gt;With our tables created, the next step is to populate them with data. Dremio’s &lt;code&gt;INSERT INTO&lt;/code&gt; command allows us to add data to Apache Iceberg tables, whether inserting individual rows or multiple records at once.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;INSERT INTO&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO &amp;lt;table_name&amp;gt; [(&amp;lt;column1&amp;gt;, &amp;lt;column2&amp;gt;, ...)]
VALUES (value1, value2, ...), (value1, value2, ...), ...;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table to insert data into, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;column1, column2, ...:&lt;/strong&gt; Optional column names if you’re inserting values into specific columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;VALUES&lt;/code&gt;:&lt;/strong&gt; A list of values to insert. You can insert one or more rows by adding sets of values separated by commas.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Inserting a Single Row&lt;/h4&gt;
&lt;p&gt;Let’s add a single row to the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO lakehouse.customers (id, first_name, last_name, age)
VALUES (1, &apos;John&apos;, &apos;Doe&apos;, 28);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;p&gt;We specify values for each column in the customers table: &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;first_name&lt;/code&gt;, &lt;code&gt;last_name&lt;/code&gt;, and &lt;code&gt;age&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This inserts a single record for a customer named John Doe, age 28.&lt;/p&gt;
&lt;h4&gt;Example 2: Inserting Multiple Rows&lt;/h4&gt;
&lt;p&gt;To add multiple rows to a table in one command, list each row in the VALUES clause.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO lakehouse.customers (id, first_name, last_name, age)
VALUES
  (2, &apos;Jane&apos;, &apos;Smith&apos;, 34),
  (3, &apos;Alice&apos;, &apos;Johnson&apos;, 22),
  (4, &apos;Bob&apos;, &apos;Williams&apos;, 45),
  (5, &apos;Charlie&apos;, &apos;Brown&apos;, 30);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We insert multiple records into the customers table in a single command.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Each set of values corresponds to a different customer, making it easy to populate the table quickly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Inserting Data into a Partitioned Table&lt;/h4&gt;
&lt;p&gt;For partitioned tables, Dremio and Iceberg automatically manage the partitioning based on the table’s partitioning rules. Let’s add some data to the customers_partitioned table, which is partitioned by the first letter of last_name.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO lakehouse.customers_partitioned (id, first_name, last_name, age)
VALUES
  (6, &apos;Emma&apos;, &apos;Anderson&apos;, 29),
  (7, &apos;Frank&apos;, &apos;Baker&apos;, 35),
  (8, &apos;Grace&apos;, &apos;Clark&apos;, 41);
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;This inserts three records into the customers_partitioned table, and Dremio will handle partitioning based on the first letter of each last_name (e.g., &amp;quot;A&amp;quot; for Anderson, &amp;quot;B&amp;quot; for Baker, and &amp;quot;C&amp;quot; for Clark).&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 4: Inserting Data with a Select Query&lt;/h4&gt;
&lt;p&gt;You can also insert data into a table by selecting data from another table. This is particularly useful if you need to copy data or load data from a staging table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;
INSERT INTO lakehouse.customers_partitioned (id, first_name, last_name, age)
SELECT id, first_name, last_name, age
FROM lakehouse.customers
WHERE age &amp;gt; 30;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;p&gt;We insert rows into &lt;code&gt;customers_partitioned&lt;/code&gt; by selecting records from the &lt;code&gt;customers&lt;/code&gt; table.
Only customers older than 30 are inserted into &lt;code&gt;customers_partitioned&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Verifying Inserted Data&lt;/h3&gt;
&lt;p&gt;To confirm that data was successfully inserted, you can use a SELECT query to retrieve and view the data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;Copy code
SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will display all rows in the customers table, allowing you to verify that your insertions were successful.&lt;/p&gt;
&lt;p&gt;With INSERT INTO, you can populate your Iceberg tables with data, either by inserting individual rows, multiple records at once, or copying data from other tables. Next, let&apos;s explore how to query this data with SQL.&lt;/p&gt;
&lt;h2&gt;How to Query Tables with SQL&lt;/h2&gt;
&lt;p&gt;With data inserted into our tables, we can now use SQL to query and analyze it. Dremio supports various SQL features, including filtering, grouping, ordering, and even Iceberg’s unique time-travel capabilities.&lt;/p&gt;
&lt;h3&gt;Basic &lt;code&gt;SELECT&lt;/code&gt; Query Syntax&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;SELECT&lt;/code&gt; command allows you to retrieve data from a table. Here’s the basic syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT [ALL | DISTINCT] &amp;lt;columns&amp;gt;
FROM &amp;lt;table_name&amp;gt;
[WHERE &amp;lt;condition&amp;gt;]
[GROUP BY &amp;lt;expression&amp;gt;]
[ORDER BY &amp;lt;column&amp;gt; [DESC]]
[LIMIT &amp;lt;count&amp;gt;];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ALL&lt;/code&gt; | &lt;code&gt;DISTINCT&lt;/code&gt;:&lt;/strong&gt; ALL returns all values, while DISTINCT eliminates duplicates. If omitted, ALL is used by default.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;columns:&lt;/strong&gt; Specify the columns you want to retrieve (e.g., id, first_name) or use * to retrieve all columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;:&lt;/strong&gt; Filters records based on a condition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;GROUP BY&lt;/code&gt;:&lt;/strong&gt; Groups records with similar values, allowing aggregate functions like &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, and &lt;code&gt;AVG&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ORDER BY&lt;/code&gt;:&lt;/strong&gt; Sorts results by one or more columns; add &lt;code&gt;DESC&lt;/code&gt; for descending order.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;LIMIT&lt;/code&gt;:&lt;/strong&gt; Restricts the number of rows returned.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Selecting All Columns&lt;/h4&gt;
&lt;p&gt;To view all data in the customers table, use SELECT *:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query retrieves every row and column in the customers table.&lt;/p&gt;
&lt;h4&gt;Example 2: Filtering Results with WHERE&lt;/h4&gt;
&lt;p&gt;Use the &lt;code&gt;WHERE&lt;/code&gt; clause to filter records based on a condition. For instance, let’s retrieve all customers over the age of 30:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers
WHERE age &amp;gt; 30;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query returns only the rows where age is greater than 30.&lt;/p&gt;
&lt;h3&gt;Example 3: Grouping Results with GROUP BY&lt;/h3&gt;
&lt;p&gt;The GROUP BY clause groups records based on a specified column, allowing you to calculate aggregates. For example, let’s count the number of customers by age:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT age, COUNT(*) AS customer_count
FROM lakehouse.customers
GROUP BY age;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We group customers by age and count the number of customers in each age group.&lt;/li&gt;
&lt;li&gt;The result shows unique ages and the number of customers for each age.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 4: Ordering Results with ORDER BY&lt;/h4&gt;
&lt;p&gt;You can sort query results by one or more columns. To get a list of customers ordered by age in descending order:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers
ORDER BY age DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will display customers from the oldest to the youngest.&lt;/p&gt;
&lt;h4&gt;Example 5: Limiting the Number of Rows with LIMIT&lt;/h4&gt;
&lt;p&gt;Use LIMIT to restrict the number of rows returned. This is useful for viewing a sample of your data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers
LIMIT 5;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query will return only the first five rows in the customers table.&lt;/p&gt;
&lt;h4&gt;Example 6: Using Iceberg’s Time-Travel with Snapshots&lt;/h4&gt;
&lt;p&gt;One of Iceberg’s powerful features is time-travel, which allows you to query historical versions of a table. You can specify a particular snapshot ID or timestamp to view data as it was at that moment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query by Snapshot ID:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers AT SNAPSHOT &apos;1234567890123456789&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replace &apos;1234567890123456789&apos; with the actual snapshot ID.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query by Timestamp:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers AT TIMESTAMP &apos;2024-01-01 00:00:00.000&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replace &apos;2024-01-01 00:00:00.000&apos; with the desired timestamp. This lets you view the table as it existed at that specific time.&lt;/p&gt;
&lt;h4&gt;Example 7: Aggregating with Window Functions&lt;/h4&gt;
&lt;p&gt;Window functions allow you to perform calculations across rows related to the current row within a specified window. For example, if we want to rank customers by age within groups, we can use RANK():&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT id, first_name, last_name, age,
  RANK() OVER (ORDER BY age DESC) AS age_rank
FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query assigns a rank based on age, with the oldest customers ranked first.&lt;/p&gt;
&lt;h3&gt;Verifying Query Results&lt;/h3&gt;
&lt;p&gt;To ensure your queries are correct, you can run them in Dremio’s SQL Runner and examine the results in the output pane. Dremio provides performance insights and query details, making it easy to optimize and validate your SQL queries.&lt;/p&gt;
&lt;p&gt;With SELECT statements, you can retrieve, filter, group, and order data in Dremio, as well as take advantage of Iceberg’s time-travel capabilities. Next, we’ll look at how to update records in your tables using SQL.&lt;/p&gt;
&lt;h2&gt;How to Update Records with SQL&lt;/h2&gt;
&lt;p&gt;In Dremio, you can use SQL to update existing records in Apache Iceberg tables, making it easy to modify data without rewriting entire datasets. The &lt;code&gt;UPDATE&lt;/code&gt; command lets you change specific columns for rows that meet certain conditions.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;UPDATE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE &amp;lt;table_name&amp;gt;
SET &amp;lt;column1&amp;gt; = &amp;lt;value1&amp;gt;, &amp;lt;column2&amp;gt; = &amp;lt;value2&amp;gt;, ...
[WHERE &amp;lt;condition&amp;gt;];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table you want to update, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;SET&lt;/code&gt;:&lt;/strong&gt; Specifies the columns and new values to assign.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;:&lt;/strong&gt; An optional clause to filter the rows that should be updated. Without &lt;code&gt;WHERE&lt;/code&gt;, all rows in the table will be updated.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Updating a Single Column&lt;/h4&gt;
&lt;p&gt;Suppose we want to update the age of a specific customer. We can use the WHERE clause to target the correct row:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
SET age = 29
WHERE id = 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We update the age of the &lt;code&gt;customer&lt;/code&gt; with &lt;code&gt;id&lt;/code&gt; = 1 to 29.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Only rows that match the condition &lt;code&gt;id&lt;/code&gt; = 1 are affected.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 2: Updating Multiple Columns&lt;/h4&gt;
&lt;p&gt;You can update multiple columns in a single UPDATE command. Let’s change both the first_name and last_name of a customer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
SET first_name = &apos;Jonathan&apos;, last_name = &apos;Doe-Smith&apos;
WHERE id = 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We update both &lt;code&gt;first_name&lt;/code&gt; and &lt;code&gt;last_name&lt;/code&gt; for the customer with &lt;code&gt;id&lt;/code&gt; = 1.&lt;/li&gt;
&lt;li&gt;This operation only affects rows that meet the &lt;code&gt;WHERE&lt;/code&gt; condition.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Conditional Updates with WHERE&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; clause allows you to apply updates based on specific conditions. For instance, let’s increase the age of all customers under 25 by 1 year:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
SET age = age + 1
WHERE age &amp;lt; 25;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We increase the age by 1 for all customers where age is less than 25.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach is useful for performing bulk updates based on a condition.&lt;/p&gt;
&lt;h4&gt;Example 4: Updating Records in a Specific Branch&lt;/h4&gt;
&lt;p&gt;If you’re using Nessie to manage versions, you can update records within a specific branch. This allows you to make updates in an isolated environment, which you can later merge into the main branch.&lt;/p&gt;
&lt;p&gt;First you&apos;d need to create a new branch&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE BRANCH development IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then you can update records in the branch&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
AT BRANCH &apos;development&apos;
SET age = 30
WHERE id = 3;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We update the age of the customer with id = 3 to 30 on the development branch.&lt;/li&gt;
&lt;li&gt;This change will only affect the specified branch until it is merged back into main.&lt;/li&gt;
&lt;li&gt;This only works for Nessie sources&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE BRANCH development INTO main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verifying Updates&lt;/h3&gt;
&lt;p&gt;To confirm your updates, you can query the table to view the modified records:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers WHERE id = 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query will display the updated row, allowing you to verify that the changes were applied successfully.&lt;/p&gt;
&lt;h3&gt;Important Notes on Updates&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transactional Safety:&lt;/strong&gt; With Apache Iceberg, updates are transactional, so they ensure data consistency and reliability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Using Branches:&lt;/strong&gt; When working with branches in Nessie, remember to specify the branch in your &lt;code&gt;UPDATE&lt;/code&gt; command if you want to limit changes to a specific branch.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using the &lt;code&gt;UPDATE&lt;/code&gt; command, you can easily modify data in your Apache Iceberg tables in Dremio. Whether updating single rows or multiple records based on conditions, Dremio’s SQL capabilities make data management flexible and efficient. In the next section, we’ll explore how to alter a table’s structure using SQL.&lt;/p&gt;
&lt;h2&gt;How to Alter a Table with SQL&lt;/h2&gt;
&lt;p&gt;As your data needs evolve, you may need to modify the structure of an Apache Iceberg table. Dremio’s &lt;code&gt;ALTER TABLE&lt;/code&gt; command provides flexibility to add, drop, or modify columns in existing tables, allowing your schema to evolve without significant disruptions.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;ALTER TABLE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE &amp;lt;table_name&amp;gt;
[ ADD COLUMNS ( &amp;lt;column_name&amp;gt; &amp;lt;data_type&amp;gt; [, ...] ) ]
[ DROP COLUMN &amp;lt;column_name&amp;gt; ]
[ ALTER COLUMN &amp;lt;column_name&amp;gt; SET MASKING POLICY &amp;lt;policy_name&amp;gt; ]
[ MODIFY COLUMN &amp;lt;column_name&amp;gt; &amp;lt;new_data_type&amp;gt; ];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table you want to alter, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ADD COLUMNS&lt;/code&gt;:&lt;/strong&gt; Adds new columns to the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;DROP COLUMN&lt;/code&gt;:&lt;/strong&gt; Removes a specified column from the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ALTER COLUMN&lt;/code&gt;:&lt;/strong&gt; Allows you to set a masking policy for data security.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;MODIFY COLUMN&lt;/code&gt;:&lt;/strong&gt; Changes the data type of an existing column.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Adding a New Column&lt;/h4&gt;
&lt;p&gt;To add a new column to an existing table, use the &lt;code&gt;ADD COLUMNS&lt;/code&gt; clause. Let’s add an email column to the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
ADD COLUMNS (email VARCHAR);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We add a new column email with the data type &lt;code&gt;VARCHAR&lt;/code&gt; to store customer email addresses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;All existing rows will have &lt;code&gt;NULL&lt;/code&gt; as the default value in the new email column until data is populated.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 2: Dropping a Column&lt;/h4&gt;
&lt;p&gt;If a column is no longer needed, you can remove it using &lt;code&gt;DROP COLUMN&lt;/code&gt;. Let’s remove the age column from the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
DROP COLUMN age;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The age column is removed from the customers table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Once a column is dropped, the action cannot be undone, so use this command carefully.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Modifying a Column’s Data Type&lt;/h4&gt;
&lt;p&gt;To change the data type of an existing column, use &lt;code&gt;MODIFY COLUMN&lt;/code&gt;. For example, let’s change the id column from &lt;code&gt;INT&lt;/code&gt; to &lt;code&gt;BIGINT&lt;/code&gt; to allow larger values.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
MODIFY COLUMN id BIGINT;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We modify the id column to have a data type of &lt;code&gt;BIGINT&lt;/code&gt;, which can store larger values than &lt;code&gt;INT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Changing data types is restricted to compatible types (e.g., &lt;code&gt;INT&lt;/code&gt; to &lt;code&gt;BIGINT&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 4: Setting a Masking Policy on a Column&lt;/h4&gt;
&lt;p&gt;Data masking can enhance data security by obscuring sensitive information. In Dremio, you can apply a masking policy to a column, making sensitive data less accessible to unauthorized users.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
ALTER COLUMN email
SET MASKING POLICY mask_email (email);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We set a masking policy called mask_email on the email column. (these policies are UDF&apos;s you must create before hand)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The masking policy defines how the data in this column is obscured when queried by users who do not have permission to view the raw data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 5: Adding a Partition Field&lt;/h4&gt;
&lt;p&gt;For Iceberg tables, you can adjust partitioning without rewriting the table. Let’s add a partition field to the customers table to partition data by the first letter of &lt;code&gt;last_name&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
ADD PARTITION FIELD truncate(1, last_name);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We partition the customers table by the first letter of &lt;code&gt;last_name&lt;/code&gt;, making queries more efficient when filtering by &lt;code&gt;last_name&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Iceberg’s partition evolution feature enables you to add or change partition fields without rewriting the existing data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Verifying Alterations&lt;/h3&gt;
&lt;p&gt;After altering a table, you can verify the changes by checking the schema in Dremio’s Datasets section or by running a &lt;code&gt;SELECT&lt;/code&gt; query to observe the modified structure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Important Notes on Table Alterations&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Apache Iceberg supports schema evolution, allowing you to make changes to table structure with minimal disruption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; Changes to partitioning do not require data rewriting, making it easy to adapt your partition strategy over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Masking:&lt;/strong&gt; Applying masking policies ensures sensitive information is protected while maintaining accessibility for authorized users.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using the ALTER TABLE command in Dremio, you can evolve the structure of your Apache Iceberg tables by adding, modifying, or removing columns, as well as updating partitioning strategies. In the next section, we’ll look at how to delete records from tables using SQL.&lt;/p&gt;
&lt;h2&gt;How to Delete Records with SQL&lt;/h2&gt;
&lt;p&gt;Deleting specific records from an Apache Iceberg table in Dremio can be done using the &lt;code&gt;DELETE&lt;/code&gt; command. This allows you to remove rows based on conditions, keeping your data relevant and up-to-date without needing to rewrite the entire dataset.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;DELETE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM &amp;lt;table_name&amp;gt;
[WHERE &amp;lt;condition&amp;gt;];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table from which you want to delete records, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WHERE:&lt;/strong&gt; An optional clause that filters rows based on a condition. Without WHERE, all rows in the table will be deleted.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Deleting Specific Records&lt;/h4&gt;
&lt;p&gt;Suppose we want to delete records of customers under the age of 18. We can use the WHERE clause to filter these rows and remove them from the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM lakehouse.customers
WHERE age &amp;lt; 18;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Only rows where age is less than 18 are deleted from the customers table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; clause ensures that only specific records are affected by the deletion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 2: Deleting All Records&lt;/h4&gt;
&lt;p&gt;If you need to clear all data from a table but keep the table structure intact, simply omit the &lt;code&gt;WHERE&lt;/code&gt; clause.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Removes all rows from the customers table without deleting the table itself.&lt;/li&gt;
&lt;li&gt;The table schema remains intact, allowing new data to be inserted into the table later.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Deleting Records in a Specific Branch&lt;/h4&gt;
&lt;p&gt;When using Nessie for versioned data management, you can delete records in an isolated branch. This allows for safe experimentation without affecting the main data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM lakehouse.customers
AT BRANCH development
WHERE age &amp;gt; 60;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We delete records where age is greater than 60 on the development branch.&lt;/li&gt;
&lt;li&gt;The main branch remains unaffected by this operation until the changes are merged back.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Verifying Deletions&lt;/h3&gt;
&lt;p&gt;To confirm that records were successfully deleted, run a &lt;code&gt;SELECT&lt;/code&gt; query on the table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will display the remaining records, allowing you to verify that the desired rows were removed.&lt;/p&gt;
&lt;h4&gt;Important Notes on Deleting Records&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transactional Deletions:&lt;/strong&gt; With Iceberg’s support for ACID compliance, deletions are transactional, ensuring consistency and reliability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control with Branches:&lt;/strong&gt; Using Nessie’s branching capabilities, you can isolate deletions in specific branches, allowing safe experimentation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;DELETE&lt;/code&gt; command in Dremio provides a straightforward way to remove unwanted data from your Apache Iceberg tables. This completes the basics of SQL operations with Apache Iceberg and Dremio, empowering you to handle data from creation to deletion with ease.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We explored the essentials of SQL operations using Apache Iceberg and Dremio. By combining Dremio’s powerful query engine with Apache Iceberg’s robust data management capabilities, you can efficiently handle large datasets, support schema evolution, and take advantage of advanced features like time-travel and branching. Here’s a quick recap of what we covered:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is SQL, Apache Iceberg, and Dremio&lt;/strong&gt;: We introduced the importance of SQL, Apache Iceberg as a data lakehouse table format, and Dremio as a platform that enhances querying capabilities in a data lakehouse environment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Setting Up an Environment with Dremio, Nessie, and MinIO&lt;/strong&gt;: We configured a local environment using Docker Compose, allowing us to work with Dremio, Nessie for version control, and MinIO for S3-compatible storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessing Dremio and Connecting Nessie&lt;/strong&gt;: We connected Dremio to Nessie and MinIO, providing a foundation for managing and querying data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Create Tables with SQL&lt;/strong&gt;: Using the &lt;code&gt;CREATE TABLE&lt;/code&gt; command, we created Apache Iceberg tables, including partitioned tables for optimized performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Insert into Tables with SQL&lt;/strong&gt;: We populated our tables using the &lt;code&gt;INSERT INTO&lt;/code&gt; command, demonstrating single and batch inserts.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Query Tables with SQL&lt;/strong&gt;: With &lt;code&gt;SELECT&lt;/code&gt; queries, we retrieved data, applied filters, grouped results, and explored Iceberg’s time-travel capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Update Records with SQL&lt;/strong&gt;: We used the &lt;code&gt;UPDATE&lt;/code&gt; command to modify specific records based on conditions, showing how to evolve data as needs change.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Alter a Table with SQL&lt;/strong&gt;: Using &lt;code&gt;ALTER TABLE&lt;/code&gt;, we modified the structure of our tables, adding, dropping, and modifying columns as our data needs evolved.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Delete Records with SQL&lt;/strong&gt;: Finally, we covered the &lt;code&gt;DELETE&lt;/code&gt; command, enabling record removal based on conditions and managing data cleanly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;With these SQL basics under your belt, here are a few ways to continue expanding your skills with Apache Iceberg and Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Explore More SQL Functions&lt;/strong&gt;: Dive deeper into SQL functions supported by Dremio to handle more complex analytical tasks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Experiment with Data Branching and Merging&lt;/strong&gt;: Use Nessie’s branching and merging capabilities for safe experimentation, making it easier to test changes without affecting production data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage Dremio Reflections&lt;/strong&gt;: Learn about Dremio’s Reflections feature to accelerate queries and enhance performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale to the Cloud&lt;/strong&gt;: Consider deploying Dremio and Iceberg in a cloud environment for greater scalability and to integrate with larger data sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By mastering these core SQL operations, you’re well-prepared to build, maintain, and analyze data in a modern data lakehouse architecture. Whether you’re managing structured or unstructured data, Dremio and Apache Iceberg offer the tools you need for efficient, flexible, and high-performance data workflows.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Dremio, Apache Iceberg and their role in AI-Ready Data</title><link>https://iceberglakehouse.com/posts/2024-11-Dremio-and-AI-Ready-Data/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-Dremio-and-AI-Ready-Data/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Tue, 05 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI models, whether for machine learning or deep learning, require vast amounts of data to train, validate, and test. But not just any data will do: this data must be accessible, scalable, and optimized for efficient processing. This is where the concept of &amp;quot;AI-ready data&amp;quot; comes into play.&lt;/p&gt;
&lt;p&gt;&amp;quot;AI-ready data&amp;quot; refers to data that meets specific criteria to support the demands of AI development: it must be accessible for easy access, scalable for large volumes, and governed to ensure compliance. Ensuring data meets these criteria can be challenging, especially with the complexity of modern data landscapes that include data lakes, databases, warehouses, and more.&lt;/p&gt;
&lt;p&gt;Let&apos;s explore the critical roles Dremio and Apache Iceberg play in making data AI-ready. By leveraging these tools, data teams can prepare, manage, and optimize structured data to meet the demands of AI workloads, helping organizations scale their AI development efficiently.&lt;/p&gt;
&lt;h2&gt;What is AI-Ready Data?&lt;/h2&gt;
&lt;p&gt;For data to be truly AI-ready, it must meet several key requirements. Here’s a look at the core attributes of AI-ready data and why each is essential in AI development:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessibility&lt;/strong&gt;: Data should be accessible from various environments and applications. AI models often rely on multiple data sources, and having data that’s readily accessible without extensive ETL (Extract, Transform, Load) processes saves time and resources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: AI workloads are typically data-intensive. To scale, data must be stored in formats that allow for efficient retrieval and processing at scale, without performance bottlenecks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Transformability&lt;/strong&gt;: AI models often require data in a particular structure or with certain attributes. AI-ready data should support complex transformations to fit the needs of different models, whether it’s feature engineering, data normalization, or other preprocessing steps.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Governance&lt;/strong&gt;: Ensuring compliance is crucial, especially when working with sensitive data. Governance controls, such as access rules and audit trails, ensure that data usage aligns with privacy policies and regulatory requirements. Governance is also important for model accuracy, making sure the model isn’t trained on irrelevant or unauthorized data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Preparing data that meets these criteria can be difficult, particularly when handling vast amounts of structured and unstructured data across multiple systems. However, with tools like Apache Iceberg and Dremio, data teams can address these challenges and streamline structured data preparation for AI workloads.&lt;/p&gt;
&lt;h2&gt;How Apache Iceberg Enables AI-Ready Structured Data&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is a powerful open table format designed for large-scale, structured data in data lakes. Its unique capabilities help make data AI-ready by ensuring accessibility, scalability, and flexibility in data management. Here’s how Iceberg supports the requirements of AI-ready data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessible, Transformable Data at Scale&lt;/strong&gt;: Apache Iceberg enables large-scale structured data to be easily accessed and transformed within data lakes, ensuring that data can be queried and modified without the complexities typically associated with data lake storage. Iceberg’s robust schema evolution and versioning features allow data to stay accessible and flexible, accommodating changing requirements for AI models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Historical Data Benchmarking with Time Travel&lt;/strong&gt;: Iceberg’s time-travel functionality allows data teams to query historical versions of data, making it possible to benchmark models against different points in time. This is invaluable for training models on data snapshots from various periods, allowing comparison and validation with past data states.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Evolution for Data Optimization&lt;/strong&gt;: Iceberg’s partition evolution feature enables experimentation with partitioning strategies, helping data teams optimize how data is organized and retrieved. Optimized partitioning allows for faster data access and retrieval, which can reduce model training time and improve overall efficiency.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With these features, Apache Iceberg helps maintain structured data that’s accessible, transformable, and optimized, creating a robust foundation for AI workloads in data lakes.&lt;/p&gt;
&lt;h2&gt;How Dremio Empowers AI-Ready Data Management&lt;/h2&gt;
&lt;p&gt;Dremio provides a unified data platform that enhances the management and accessibility of data, making it an ideal tool for preparing AI-ready data. Here are some of the ways Dremio’s features support AI development:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;First-Class Support for Apache Iceberg&lt;/strong&gt;: Dremio integrates seamlessly with Apache Iceberg, allowing users to manage and query Iceberg tables without complex configurations. This makes it easier for data teams to leverage Iceberg’s capabilities directly within Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Federation Across Multiple Sources&lt;/strong&gt;: Dremio enables federated queries across databases, data warehouses, data lakes, and lakehouse catalogs, providing a unified view of disparate data sources. This removes data silos and allows AI models to access and utilize data from a variety of sources without moving or duplicating data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Curated Views for Simplified Data Wrangling&lt;/strong&gt;: Dremio allows users to create curated views on top of multiple data sources, simplifying data wrangling and transformation. These views provide a streamlined view of the data, making it easier to prepare data for AI without extensive data processing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrated Catalog with Versioning&lt;/strong&gt;: Dremio’s integrated catalog supports versioning with multi-table branching, merging, and tagging. This allows data teams to create replicable data snapshots and zero-copy experimental environments, making it easy to experiment, tag datasets, and manage different versions of data used for AI development.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Arrow Flight for Fast Data Access&lt;/strong&gt;: Dremio supports Apache Arrow Flight, a high-performance protocol that allows data to be pulled from Dremio at speeds much faster than traditional JDBC. This significantly accelerates data retrieval for model training, reducing overall model development time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Comprehensive SQL Functions for Data Wrangling&lt;/strong&gt;: Dremio provides a rich set of SQL functions that help data teams perform complex transformations and data wrangling tasks, making it efficient to prepare data for AI workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Granular Access Controls&lt;/strong&gt;: Dremio offers role-based, row-based, and column-based access controls, ensuring that only authorized data is used for model training. This helps maintain compliance and prevents models from training on sensitive or unauthorized data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Acceleration with Data Reflections&lt;/strong&gt;: Dremio’s data reflections feature enables efficient query acceleration by creating optimized representations of datasets, tailored for specific types of queries. Data reflections reduce the need to repeatedly process raw data, instead offering pre-aggregated or pre-sorted versions that speed up query performance. For AI workloads, this translates to faster data retrieval, especially when models require frequent access to large or complex datasets, significantly reducing wait times during model training and experimentation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By combining data federation, powerful data wrangling tools, integrated catalog management, and high-performance data access, Dremio empowers teams to manage data effectively for AI, supporting a seamless flow from raw data to AI-ready datasets.&lt;/p&gt;
&lt;h2&gt;Use Cases: Dremio and Apache Iceberg for AI Workloads&lt;/h2&gt;
&lt;p&gt;Let’s look at some practical scenarios where Dremio and Apache Iceberg streamline data preparation for AI workloads, showcasing how they help overcome common challenges in AI development:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Training Models on Historical Data Snapshots&lt;/strong&gt;: With Iceberg’s time-travel capabilities, data teams can train models on historical snapshots, enabling AI models to learn from data as it existed in different periods. This is particularly useful for time-sensitive applications, such as financial forecasting or customer behavior analysis, where benchmarking against historical trends is essential.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Experimenting with Data Optimization for Faster Model Training&lt;/strong&gt;: Iceberg’s partition evolution and Dremio’s curated views allow data teams to experiment with data layouts and transformations. By optimizing data partitioning, models can retrieve data faster, resulting in more efficient model training and faster experimentation cycles.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Creating Zero-Copy Experimental Environments&lt;/strong&gt;: With Dremio’s integrated catalog versioning, data teams can create isolated, zero-copy environments to test AI models on different datasets or data versions without affecting the original data. This enables rapid prototyping and experimentation, allowing data scientists to try different approaches and configurations safely and efficiently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Access to Diverse Data Sources for AI Development&lt;/strong&gt;: Dremio’s federated query capabilities enable AI models to access data across multiple sources, such as relational databases, data warehouses, and data lakes. This allows data scientists to bring together diverse datasets without moving or duplicating data, providing a more comprehensive training set for their models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ensuring Compliance with Fine-Grained Access Controls&lt;/strong&gt;: Dremio’s role-based, row-based, and column-based access controls ensure that AI models train only on permissible data. This level of data governance is crucial for models that must meet regulatory standards, such as those in healthcare, finance, or other highly regulated industries.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Having access to &amp;quot;AI-ready data&amp;quot; is paramount for developing models that are accurate, efficient, and compliant. Dremio and Apache Iceberg are instrumental in creating a robust foundation for AI workloads, making it easy to access, transform, and manage large-scale structured data.&lt;/p&gt;
&lt;p&gt;With Iceberg, data teams gain control over data management at scale, leveraging features like time travel and partition evolution to keep data organized and optimized. Dremio complements this with seamless Iceberg integration, federated data access, and powerful data wrangling capabilities, enabling a smooth path from raw data to AI-ready datasets.&lt;/p&gt;
&lt;p&gt;Together, Dremio and Apache Iceberg provide an end-to-end solution that empowers data teams to meet the demands of modern AI. Whether you’re building models on historical data, experimenting with data partitions, or ensuring compliance with strict governance rules, Dremio and Iceberg offer the tools you need to manage and optimize data, setting the stage for successful AI development.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Cargo and cargo.toml</title><link>https://iceberglakehouse.com/posts/2024-11-rust-cargo-toml/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-rust-cargo-toml/</guid><description>
When working with Rust, Cargo is your go-to tool for managing dependencies, building, and running your projects. Acting as Rust&apos;s package manager and...</description><pubDate>Tue, 05 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;When working with Rust, Cargo is your go-to tool for managing dependencies, building, and running your projects. Acting as Rust&apos;s package manager and build system, Cargo simplifies a lot of the heavy lifting in a project’s lifecycle. Central to this is the &lt;code&gt;cargo.toml&lt;/code&gt; file, which is at the heart of every Cargo-managed Rust project.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;cargo.toml&lt;/code&gt; file serves as the project&apos;s configuration file, defining essential details like metadata, dependencies, and optional features. This file not only controls which libraries your project depends on but also provides configurations for different build profiles, conditional compilation features, and workspace settings.&lt;/p&gt;
&lt;p&gt;Understanding &lt;code&gt;cargo.toml&lt;/code&gt; is crucial for managing dependencies efficiently, setting up multiple crates within a workspace, and optimizing your project&apos;s build performance. In this guide, we’ll explore how &lt;code&gt;cargo.toml&lt;/code&gt; is structured, how to add dependencies, define build configurations, and make the most of this file to manage your Rust projects effectively.&lt;/p&gt;
&lt;h2&gt;Structure of the &lt;code&gt;cargo.toml&lt;/code&gt; File&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;cargo.toml&lt;/code&gt; file is organized into multiple sections, each serving a specific purpose in configuring various aspects of a Rust project. Let’s break down the key sections you’ll encounter:&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[package]&lt;/code&gt;: General Project Metadata&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[package]&lt;/code&gt; section contains metadata about your Rust project. It includes fields like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;name&lt;/code&gt;: The name of your package, which should be unique if you’re publishing to crates.io.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;version&lt;/code&gt;: The version of your project, following Semantic Versioning (e.g., &lt;code&gt;1.0.0&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;authors&lt;/code&gt;: Your name or the names of the contributors (optional).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;edition&lt;/code&gt;: Specifies the Rust edition you’re using, such as &lt;code&gt;2018&lt;/code&gt; or &lt;code&gt;2021&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;my_project&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
authors = [&amp;quot;Alex Merced &amp;lt;alex@example.com&amp;gt;&amp;quot;]
edition = &amp;quot;2021&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;code&gt;[dependencies]&lt;/code&gt;: Managing Project Dependencies&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[dependencies]&lt;/code&gt; section lists the external libraries your project relies on. For each dependency, you specify the name and version, and Cargo will automatically download and manage these dependencies.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = &amp;quot;1.0&amp;quot;
reqwest = { version = &amp;quot;0.11&amp;quot;, features = [&amp;quot;json&amp;quot;] }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example includes serde with a version constraint and reqwest with specific features enabled.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[dev-dependencies]&lt;/code&gt;: Development-Only Dependencies&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;[dev-dependencies]&lt;/code&gt; works like &lt;code&gt;[dependencies]&lt;/code&gt; but is only used for development or testing. For example, if you need a library solely for testing, you can add it here, and it won’t be included in the final build.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dev-dependencies]
rand = &amp;quot;0.8&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;code&gt;[features]&lt;/code&gt;: Defining Optional Features&lt;/h3&gt;
&lt;p&gt;Features allow you to conditionally include dependencies or enable specific parts of your project. They’re useful for creating optional functionality and reducing bloat in builds.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[features]
default = [&amp;quot;json_support&amp;quot;]
json_support = [&amp;quot;serde&amp;quot;, &amp;quot;serde_json&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the &lt;code&gt;json_support&lt;/code&gt; feature adds &lt;code&gt;serde&lt;/code&gt; and &lt;code&gt;serde_json&lt;/code&gt; libraries, and it’s included by default.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[profile]&lt;/code&gt;: Configurations for Build Profiles&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[profile]&lt;/code&gt; section allows customization of build settings for different profiles, such as dev for development and release for optimized production builds. Adjusting these settings helps optimize for speed, size, or other factors based on your environment.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.release]
opt-level = 3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the opt-level for release builds is set to 3, the highest optimization level.&lt;/p&gt;
&lt;p&gt;These sections provide a foundational understanding of cargo.toml. In the following sections, we’ll dive into more details on each and show how to use them effectively.&lt;/p&gt;
&lt;h2&gt;Configuring Project Metadata&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;[package]&lt;/code&gt; section of &lt;code&gt;cargo.toml&lt;/code&gt; provides essential metadata about your project, which can be useful for project organization, publishing, and versioning. Let’s explore the common fields used within this section and their purposes:&lt;/p&gt;
&lt;h3&gt;Key Fields in &lt;code&gt;[package]&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;name&lt;/code&gt;&lt;/strong&gt;: The name of your project, which should be unique if you plan to publish to crates.io. This name is how users will identify and include your crate as a dependency.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;name = &amp;quot;my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;version&lt;/code&gt;&lt;/strong&gt;: Specifies the current version of your project. Cargo follows Semantic Versioning, so use a version format like 0.1.0 or 1.0.0. This field is especially important for tracking releases.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;  version = &amp;quot;0.1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;authors&lt;/code&gt;&lt;/strong&gt;: An optional list of contributors’ names or emails. Although it’s not mandatory, adding authors can help document who has worked on the project.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;authors = [&amp;quot;Alex Merced &amp;lt;alex@example.com&amp;gt;&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;edition&lt;/code&gt;&lt;/strong&gt;: Specifies the Rust edition your project is based on. The most common editions are 2018 and 2021. This setting ensures compatibility with language features specific to each edition.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;edition = &amp;quot;2021&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;description&lt;/code&gt;&lt;/strong&gt;: A short description of your project, which is optional but useful if you plan to publish your crate. It gives users a quick idea of what your project does.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;description = &amp;quot;A simple Rust project demonstrating cargo.toml usage&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;license:&lt;/code&gt;&lt;/strong&gt; Defines the license under which your project is distributed. Common choices include MIT, Apache-2.0, or GPL-3.0. Licensing helps clarify legal use for other developers and users.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;license = &amp;quot;MIT&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;repository:&lt;/code&gt;&lt;/strong&gt; A link to the project’s repository (e.g., GitHub). Providing this link is helpful for users who want to see the source code or contribute.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;repository = &amp;quot;https://github.com/alexmerced/my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;documentation:&lt;/code&gt;&lt;/strong&gt; A URL linking to the project’s documentation. This is especially useful if you’ve hosted API docs, like those generated by cargo doc, on platforms such as docs.rs.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;documentation = &amp;quot;https://docs.rs/my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example &lt;code&gt;[package]&lt;/code&gt; Section&lt;/h3&gt;
&lt;p&gt;Here’s an example that combines these fields to form a complete &lt;code&gt;[package]&lt;/code&gt; configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;my_project&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
authors = [&amp;quot;Alex Merced &amp;lt;alex@example.com&amp;gt;&amp;quot;]
edition = &amp;quot;2021&amp;quot;
description = &amp;quot;A simple Rust project demonstrating cargo.toml usage&amp;quot;
license = &amp;quot;MIT&amp;quot;
repository = &amp;quot;https://github.com/alexmerced/my_project&amp;quot;
documentation = &amp;quot;https://docs.rs/my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup makes your project easier to understand, document, and share. With a well-configured &lt;code&gt;[package]&lt;/code&gt; section, your project gains a professional touch, preparing it for development, collaboration, or even public release on crates.io.&lt;/p&gt;
&lt;h2&gt;Adding and Managing Dependencies&lt;/h2&gt;
&lt;p&gt;Dependencies are a core aspect of any Rust project, enabling you to reuse code and leverage external libraries. The &lt;code&gt;[dependencies]&lt;/code&gt; section of &lt;code&gt;cargo.toml&lt;/code&gt; lets you specify which libraries (or &amp;quot;crates&amp;quot;) your project requires and manages them efficiently.&lt;/p&gt;
&lt;h3&gt;Basic Dependency Syntax&lt;/h3&gt;
&lt;p&gt;To add a dependency, simply specify the crate name and version in the &lt;code&gt;[dependencies]&lt;/code&gt; section. Cargo will automatically fetch and compile it for you.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = &amp;quot;1.0&amp;quot;  # Add Serde library for serialization/deserialization
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the serde crate will be added at the latest compatible version within the &lt;code&gt;1.0.x&lt;/code&gt; series. Cargo&apos;s versioning follows Semantic Versioning, meaning &lt;code&gt;1.0&lt;/code&gt; covers any version from &lt;code&gt;1.0.0&lt;/code&gt; to &lt;code&gt;&amp;lt;2.0.0&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Specifying Dependency Versions&lt;/h3&gt;
&lt;p&gt;You can control the version of each dependency by using different version specifiers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Exact Version&lt;/strong&gt;: Only uses this exact version.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;=1.0.104&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Caret (&lt;code&gt;^&lt;/code&gt;)&lt;/strong&gt;: Allows updates within the same major version (default behavior).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;^1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tilde (&lt;code&gt;~&lt;/code&gt;)&lt;/strong&gt;: Allows updates within the same minor version.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;~1.0.104&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Wildcard (&lt;code&gt;*&lt;/code&gt;)&lt;/strong&gt;: Accepts any version, which can lead to unpredictable changes in your project.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;*&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Using Features with Dependencies&lt;/h3&gt;
&lt;p&gt;Some crates offer optional features that you can enable in cargo.toml. For instance, the reqwest crate has features for JSON support. You can enable these by specifying them within the dependency configuration.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
reqwest = { version = &amp;quot;0.11&amp;quot;, features = [&amp;quot;json&amp;quot;] }
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Adding Git Dependencies&lt;/h3&gt;
&lt;p&gt;Cargo supports dependencies directly from Git repositories, allowing you to include unreleased versions or custom forks. You can also specify a branch, tag, or commit to ensure consistency.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
my_crate = { git = &amp;quot;https://github.com/user/my_crate.git&amp;quot;, branch = &amp;quot;main&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Path Dependencies for Local Crates&lt;/h3&gt;
&lt;p&gt;If you have a local crate you want to use as a dependency, specify its path. This is useful for working on related crates without publishing them.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
my_local_crate = { path = &amp;quot;../my_local_crate&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Dev-Only Dependencies&lt;/h3&gt;
&lt;p&gt;Dependencies in the &lt;code&gt;[dev-dependencies]&lt;/code&gt; section are only used for development (e.g., testing frameworks) and will not be included in the final build. This helps keep production builds smaller and faster.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dev-dependencies]
rand = &amp;quot;0.8&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Optional Dependencies&lt;/h3&gt;
&lt;p&gt;Optional dependencies can be enabled as needed by configuring them in &lt;code&gt;[features]&lt;/code&gt; and adding them to cargo.toml. This allows you to activate these dependencies on demand, reducing bloat.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }

[features]
default = []
json_support = [&amp;quot;serde_json&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, you can enable &lt;code&gt;json_support&lt;/code&gt; by using &lt;code&gt;cargo build --features &amp;quot;json_support&amp;quot;&lt;/code&gt;, adding the functionality only when needed.&lt;/p&gt;
&lt;p&gt;Example of a Complete &lt;code&gt;[dependencies]&lt;/code&gt; Section
Here’s a &lt;code&gt;[dependencies]&lt;/code&gt; section showcasing different types of dependencies:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = &amp;quot;1.0&amp;quot;  # Standard dependency
rand = { version = &amp;quot;0.8&amp;quot;, features = [&amp;quot;small_rng&amp;quot;] }  # Dependency with features
my_crate = { git = &amp;quot;https://github.com/user/my_crate.git&amp;quot;, branch = &amp;quot;main&amp;quot; }  # Git dependency
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }  # Optional dependency

[dev-dependencies]
mockito = &amp;quot;0.29&amp;quot;  # Dev-only dependency

[features]
default = []
json_support = [&amp;quot;serde_json&amp;quot;]  # Feature for optional dependency
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup provides flexibility for managing dependencies based on your project’s requirements. By organizing dependencies in this way, you gain control over your project’s footprint, allowing for efficient, maintainable, and optimized builds.&lt;/p&gt;
&lt;h2&gt;Using Features for Conditional Compilation&lt;/h2&gt;
&lt;p&gt;Features in &lt;code&gt;cargo.toml&lt;/code&gt; allow you to enable or disable certain functionalities within your project based on conditional dependencies. This is particularly useful when you want to offer optional components or modularize your code for different use cases. By using feature flags, you can control which parts of your codebase get compiled, helping to keep the build lightweight and efficient.&lt;/p&gt;
&lt;h3&gt;Defining Features in &lt;code&gt;cargo.toml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;To define features, add them under the &lt;code&gt;[features]&lt;/code&gt; section in &lt;code&gt;cargo.toml&lt;/code&gt;. Each feature is a list of dependencies or other features that should be enabled when the feature itself is activated.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[features]
default = [&amp;quot;json_support&amp;quot;]  # Sets `json_support` as the default feature
json_support = [&amp;quot;serde&amp;quot;, &amp;quot;serde_json&amp;quot;]  # Enables Serde and Serde JSON support
async = [&amp;quot;tokio&amp;quot;]  # Adds async functionality with Tokio
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The default feature includes &lt;code&gt;json_support&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;json_support&lt;/code&gt; feature enables both &lt;code&gt;serde&lt;/code&gt; and &lt;code&gt;serde_json&lt;/code&gt; libraries.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;async&lt;/code&gt; feature brings in tokio for asynchronous programming.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Enabling Features at Build Time&lt;/h3&gt;
&lt;p&gt;To compile with a specific feature, use the &lt;code&gt;--features&lt;/code&gt; flag when running Cargo commands, like &lt;code&gt;cargo build&lt;/code&gt;. For example, to enable the &lt;code&gt;async&lt;/code&gt; feature, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --features &amp;quot;async&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your default feature is defined, it will be activated by default unless you specify &lt;code&gt;--no-default-features&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --no-default-features --features &amp;quot;async&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Feature Flags in Code&lt;/h3&gt;
&lt;p&gt;In your Rust code, you can use the cfg attribute to conditionally include code based on active features. This keeps the codebase modular and allows you to add/remove functionality based on build requirements.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;#[cfg(feature = &amp;quot;async&amp;quot;)]
async fn async_function() {
    // Async function logic
}

#[cfg(not(feature = &amp;quot;async&amp;quot;))]
fn async_function() {
    // Non-async fallback logic
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the async_function function behaves differently depending on whether the async feature is enabled.&lt;/p&gt;
&lt;h3&gt;Combining Multiple Features&lt;/h3&gt;
&lt;p&gt;Sometimes, you might want a feature that only enables certain functionality if multiple other features are active. You can achieve this by combining features in the &lt;code&gt;[features]&lt;/code&gt; section.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[features]
default = []
full = [&amp;quot;json_support&amp;quot;, &amp;quot;async&amp;quot;]  # Combines `json_support` and `async`
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this configuration, enabling the &lt;code&gt;full feature&lt;/code&gt; will activate both &lt;code&gt;json_support&lt;/code&gt; and &lt;code&gt;async&lt;/code&gt; simultaneously.&lt;/p&gt;
&lt;h3&gt;Practical Example of Feature Flags&lt;/h3&gt;
&lt;p&gt;Suppose you’re building a library that has JSON support and async capabilities as optional features. Here’s how your cargo.toml might look:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = { version = &amp;quot;1.0&amp;quot;, optional = true }
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }
tokio = { version = &amp;quot;1.0&amp;quot;, optional = true }

[features]
default = []
json_support = [&amp;quot;serde&amp;quot;, &amp;quot;serde_json&amp;quot;]
async = [&amp;quot;tokio&amp;quot;]
full = [&amp;quot;json_support&amp;quot;, &amp;quot;async&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;json_support&lt;/code&gt; feature enables &lt;code&gt;serde&lt;/code&gt; and &lt;code&gt;serde_json&lt;/code&gt; for JSON handling.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;async&lt;/code&gt; feature enables &lt;code&gt;tokio&lt;/code&gt; for asynchronous programming.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;full feature&lt;/code&gt; enables both &lt;code&gt;json_support&lt;/code&gt; and &lt;code&gt;async&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To use only JSON support, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --features &amp;quot;json_support&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or to use everything with the full feature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --features &amp;quot;full&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits of Using Features&lt;/h3&gt;
&lt;p&gt;Using feature flags in cargo.toml can make your project more flexible and modular:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduce Bloat&lt;/strong&gt;: Only compile what’s necessary for each use case.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Compile Times&lt;/strong&gt;: Faster compilation when unused features are disabled.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeted Functionality&lt;/strong&gt;: Offer a single codebase with multiple configurations, making your library or application more adaptable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With feature flags, cargo.toml enables conditional compilation that fits various project requirements and user preferences, optimizing both development and runtime performance.&lt;/p&gt;
&lt;h2&gt;Configuring Build Profiles&lt;/h2&gt;
&lt;p&gt;Cargo provides different build profiles to optimize your project based on specific needs, such as development or production. These profiles let you adjust settings like optimization levels, debug symbols, and other compiler flags. The main profiles in &lt;code&gt;cargo.toml&lt;/code&gt; are &lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;release&lt;/code&gt;, and custom profiles you can define as needed.&lt;/p&gt;
&lt;h3&gt;Common Build Profiles&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;dev&lt;/code&gt;&lt;/strong&gt;: This is the default profile for development builds, which prioritizes compile speed over runtime performance. It includes debug information but does not heavily optimize the code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;release&lt;/code&gt;&lt;/strong&gt;: The release profile is optimized for performance and typically used for production builds. It enables higher levels of optimization but takes longer to compile.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configuring Profiles in &lt;code&gt;cargo.toml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;You can customize each profile by defining them in the &lt;code&gt;[profile.*]&lt;/code&gt; sections of &lt;code&gt;cargo.toml&lt;/code&gt;. Each profile has various settings that control the build process:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;opt-level&lt;/code&gt;&lt;/strong&gt;: Controls the optimization level, with values from 0 (no optimization) to 3 (maximum optimization).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;debug&lt;/code&gt;&lt;/strong&gt;: Controls the inclusion of debug symbols, helpful for debugging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;lto&lt;/code&gt;&lt;/strong&gt;: Enables Link-Time Optimization, which can reduce binary size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;panic&lt;/code&gt;&lt;/strong&gt;: Determines how panics are handled (&lt;code&gt;unwind&lt;/code&gt; or &lt;code&gt;abort&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Customizing the &lt;code&gt;dev&lt;/code&gt; Profile&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;dev&lt;/code&gt; profile is ideal for development, focusing on quick compile times and ease of debugging. You might want to add minimal optimization for better performance while testing.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.dev]
opt-level = 0  # No optimization for fast compile times
debug = true   # Include debug symbols
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, no optimization is applied to keep build times short, and debug symbols are included to aid debugging.&lt;/p&gt;
&lt;h3&gt;Customizing the release Profile&lt;/h3&gt;
&lt;p&gt;The release profile is typically used for production builds, prioritizing runtime performance through higher optimization levels. This can make your application faster and reduce binary size, but it comes with longer compile times.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.release]
opt-level = 3    # Maximum optimization for performance
debug = false    # Exclude debug symbols for smaller binary size
lto = true       # Link-Time Optimization for further size reduction
panic = &amp;quot;abort&amp;quot;  # Use `abort` to reduce binary size further
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The opt-level of 3 maximizes performance.&lt;/li&gt;
&lt;li&gt;debug is set to false to exclude debug symbols, keeping the binary smaller.&lt;/li&gt;
&lt;li&gt;lto enables Link-Time Optimization to further reduce the binary size.&lt;/li&gt;
&lt;li&gt;panic = &amp;quot;abort&amp;quot; changes the panic strategy to abort, which can further reduce binary size.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Defining Custom Profiles&lt;/h3&gt;
&lt;p&gt;You can create custom profiles if you need specific settings for different environments, such as testing or benchmarking. For instance, a bench profile could be created to optimize for performance testing.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.bench]
opt-level = 3
debug = false
overflow-checks = false  # Disable overflow checks for benchmarking
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This bench profile maximizes performance by disabling overflow checks and excluding debug symbols, making it suitable for benchmarking.&lt;/p&gt;
&lt;p&gt;Example of a Complete Profile Configuration
Here’s an example configuration that customizes both dev and release profiles while adding a custom bench profile:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.dev]
opt-level = 1       # Low-level optimization for faster dev builds
debug = true        # Include debug symbols
overflow-checks = true

[profile.release]
opt-level = 3       # Max optimization for production
debug = false       # Exclude debug symbols
lto = &amp;quot;fat&amp;quot;         # Enable Link-Time Optimization
panic = &amp;quot;abort&amp;quot;     # Use abort for panics

[profile.bench]
opt-level = 3       # High optimization for benchmarks
debug = false       # Exclude debug symbols for smaller binary
overflow-checks = false  # Disable overflow checks to reduce overhead
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Choosing the Right Profile&lt;/h3&gt;
&lt;p&gt;When building, Cargo automatically selects the dev profile for cargo build and the release profile for &lt;code&gt;cargo build --release&lt;/code&gt;. You can also specify custom profiles when running cargo commands by using the &lt;code&gt;--profile&lt;/code&gt; flag:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --profile bench
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits of Profile Customization&lt;/h3&gt;
&lt;p&gt;Customizing profiles in cargo.toml helps you optimize your project based on your current needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Development Efficiency&lt;/strong&gt;: Faster builds with the dev profile keep your development loop quick.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Production Performance&lt;/strong&gt;: release profile optimizations ensure your app runs efficiently in production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeted Tuning&lt;/strong&gt;: Custom profiles allow you to fine-tune settings for testing, benchmarking, or any other specialized needs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Configuring build profiles is a powerful way to control the balance between performance, debugging, and compile time, giving you a flexible workflow from development to production&lt;/p&gt;
&lt;h2&gt;Workspace and Sub-Crate Configurations&lt;/h2&gt;
&lt;p&gt;In Rust, a workspace allows you to manage multiple related packages (or &amp;quot;crates&amp;quot;) within a single project directory, sharing common dependencies and build output. Workspaces are helpful when you want to organize large projects into smaller, modular crates that can be built, tested, and developed together. This setup is especially valuable for monorepo-style projects, where all related crates live in a single repository.&lt;/p&gt;
&lt;h3&gt;Setting Up a Workspace&lt;/h3&gt;
&lt;p&gt;To create a workspace, start by defining a &lt;code&gt;[workspace]&lt;/code&gt; section in the root &lt;code&gt;cargo.toml&lt;/code&gt; file. In this section, you’ll specify which directories contain the member crates of the workspace.&lt;/p&gt;
&lt;p&gt;For example, in the root &lt;code&gt;cargo.toml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace]
members = [&amp;quot;crate_a&amp;quot;, &amp;quot;crate_b&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup indicates that there are two crates within the workspace: &lt;code&gt;crate_a&lt;/code&gt; and &lt;code&gt;crate_b&lt;/code&gt;, located in directories named &lt;code&gt;crate_a&lt;/code&gt; and &lt;code&gt;crate_b&lt;/code&gt; within the project root.&lt;/p&gt;
&lt;h3&gt;Creating Sub-Crates&lt;/h3&gt;
&lt;p&gt;Each member of the workspace (sub-crate) needs its own &lt;code&gt;cargo.toml&lt;/code&gt; file, where you define the specific dependencies and settings for that crate. Each crate in a workspace functions as an independent Rust package but shares common build output and dependencies with the other workspace members.&lt;/p&gt;
&lt;p&gt;For example, the &lt;code&gt;cargo.toml&lt;/code&gt; for &lt;code&gt;crate_a&lt;/code&gt; might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;crate_a&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
edition = &amp;quot;2021&amp;quot;

[dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And &lt;code&gt;crate_b&lt;/code&gt;’s &lt;code&gt;cargo.toml&lt;/code&gt; could be:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;crate_b&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
edition = &amp;quot;2021&amp;quot;

[dependencies]
rand = &amp;quot;0.8&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Sharing Dependencies Across Crates&lt;/h3&gt;
&lt;p&gt;One of the advantages of a workspace is that it allows crates to share dependencies, reducing duplication and ensuring version consistency. You can specify dependencies in the root cargo.toml so that all workspace members have access to them without redefining the dependencies in each sub-crate.&lt;/p&gt;
&lt;p&gt;For example, you can add a shared dependency like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace.dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, all workspace members can use serde without adding it to their individual cargo.toml files.&lt;/p&gt;
&lt;h3&gt;Inter-Crate Dependencies&lt;/h3&gt;
&lt;p&gt;In many cases, one crate in a workspace will depend on another crate in the same workspace. To specify such a dependency, reference the other crate by name in the cargo.toml file, and Cargo will understand that it refers to a member of the workspace.&lt;/p&gt;
&lt;p&gt;For example, if &lt;code&gt;crate_b&lt;/code&gt; depends on &lt;code&gt;crate_a&lt;/code&gt;, you would add this to &lt;code&gt;crate_b&lt;/code&gt;&apos;s &lt;code&gt;cargo.toml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
crate_a = { path = &amp;quot;../crate_a&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cargo will recognize &lt;code&gt;crate_a&lt;/code&gt; as part of the workspace and handle the dependency locally.&lt;/p&gt;
&lt;h3&gt;Managing Workspace Configuration&lt;/h3&gt;
&lt;p&gt;You can also set configurations specific to the workspace, such as build profiles or custom features, within the &lt;code&gt;[workspace]&lt;/code&gt; section of the root cargo.toml. This allows you to configure build settings and features that apply across all workspace members.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace]
members = [&amp;quot;crate_a&amp;quot;, &amp;quot;crate_b&amp;quot;]

[profile.release]
opt-level = 3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, all crates in the workspace will use an optimization level of 3 for release builds, reducing binary size and improving runtime performance.&lt;/p&gt;
&lt;h3&gt;Example Project Structure&lt;/h3&gt;
&lt;p&gt;Here’s how a workspace project might look in your file system:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;my_workspace/
├── Cargo.toml           # Root workspace configuration
├── crate_a/
│   └── Cargo.toml       # crate_a configuration
├── crate_b/
│   └── Cargo.toml       # crate_b configuration
└── target/              # Shared build output directory
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this structure, all build output will be stored in a single target/ directory, reducing redundancy and speeding up compilation when multiple crates share dependencies.&lt;/p&gt;
&lt;h3&gt;Benefits of Using Workspaces&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dependency Management&lt;/strong&gt;: Avoid duplicating dependencies by sharing them across crates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Efficiency&lt;/strong&gt;: Workspace members share a single target/ directory, reducing compilation time and storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Modularity&lt;/strong&gt;: Break down complex projects into modular crates that can be developed and tested independently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control&lt;/strong&gt;: Simplifies managing versioning within related packages, especially useful for large projects.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By setting up a workspace, you can streamline your project structure, reduce duplication, and make your Rust project more modular and scalable, all while keeping related packages tightly integrated.&lt;/p&gt;
&lt;h2&gt;Advanced Configuration Options&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;cargo.toml&lt;/code&gt; file provides several advanced options that allow you to further customize and fine-tune your Rust project. These configurations are useful for handling edge cases, managing dependencies in complex projects, and adding metadata to your package. Let’s explore some of these advanced options.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[patch]&lt;/code&gt;: Overriding Dependencies&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[patch]&lt;/code&gt; section allows you to override dependencies across your project. This is helpful if you need to fix a bug in an external crate or use a custom version of a dependency without waiting for an official release. By specifying &lt;code&gt;[patch]&lt;/code&gt;, you can tell Cargo to use a different source for a specific dependency across the entire workspace.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[patch.crates-io]
serde = { git = &amp;quot;https://github.com/your-fork/serde.git&amp;quot;, branch = &amp;quot;fix-branch&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, all references to serde in the project will use the specified Git repository instead of crates.io.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[replace]&lt;/code&gt;: Replacing Dependencies&lt;/h3&gt;
&lt;p&gt;Similar to &lt;code&gt;[patch]&lt;/code&gt;, the &lt;code&gt;[replace]&lt;/code&gt; section lets you swap out a specific version of a dependency. However, it’s more restrictive and generally used in very specific cases, like managing local dependencies. &lt;code&gt;[replace]&lt;/code&gt; should be used cautiously because it can lead to version conflicts.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[replace]
&amp;quot;rand:0.8.3&amp;quot; = { path = &amp;quot;local_path_to_rand&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the rand version 0.8.3 dependency is replaced by a local path, allowing you to work with a local copy.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[build-dependencies]&lt;/code&gt;: Dependencies for Build Scripts&lt;/h3&gt;
&lt;p&gt;Sometimes, a Rust project needs a custom build script (e.g., build.rs) to generate or process files before compilation. The &lt;code&gt;[build-dependencies]&lt;/code&gt; section is used to specify dependencies required only by the build script, avoiding unnecessary dependencies in the final build.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[build-dependencies]
cc = &amp;quot;1.0&amp;quot;  # Compiler tool for building C dependencies
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the &lt;code&gt;cc&lt;/code&gt; crate is available only to the &lt;code&gt;build.rs&lt;/code&gt; script, allowing you to compile native code or other build-specific tasks.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[badges]&lt;/code&gt;: Adding Metadata for Continuous Integration (CI)&lt;/h3&gt;
&lt;p&gt;Badges provide a way to display status information, such as build status, on your project’s page on crates.io or GitHub. The &lt;code&gt;[badges]&lt;/code&gt; section allows you to define these directly in &lt;code&gt;cargo.toml&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[badges]
travis-ci = { repository = &amp;quot;user/my_project&amp;quot; }
github-actions = { repository = &amp;quot;user/my_project&amp;quot;, branch = &amp;quot;main&amp;quot;, workflow = &amp;quot;CI&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, badges for Travis CI and GitHub Actions are configured, displaying their status on platforms that support badges.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[package.metadata]&lt;/code&gt;: Custom Metadata&lt;/h3&gt;
&lt;p&gt;The [package.metadata] section allows you to add custom fields that are not processed by Cargo itself but can be used by external tools. This is useful for plugins or scripts that require specific information beyond the default Cargo configuration.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package.metadata]
documentation_url = &amp;quot;https://docs.rs/my_project&amp;quot;
custom_key = &amp;quot;custom_value&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;External tools can read these values to provide custom functionality for your project.&lt;/p&gt;
&lt;h3&gt;Defining &lt;code&gt;build.rs&lt;/code&gt; Scripts&lt;/h3&gt;
&lt;p&gt;If your project requires dynamic configuration, you can create a &lt;code&gt;build.rs&lt;/code&gt; file, which Cargo automatically runs before compiling your project. The &lt;code&gt;build.rs&lt;/code&gt; file can generate code, compile additional resources, or link native libraries. In cargo.toml, dependencies for this script should be listed under &lt;code&gt;[build-dependencies]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Example &lt;code&gt;build.rs&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;fn main() {
    println!(&amp;quot;cargo:rustc-link-lib=static=foo&amp;quot;);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example tells Cargo to link a static library named foo to your project. You can control these instructions via environment variables, allowing your build process to adapt to different platforms.&lt;/p&gt;
&lt;h3&gt;Using &lt;code&gt;[workspace.dependencies]&lt;/code&gt; for Shared Dependencies&lt;/h3&gt;
&lt;p&gt;In a workspace, you may want all crates to use the same version of a shared dependency. You can specify such dependencies in the &lt;code&gt;[workspace.dependencies]&lt;/code&gt; section, making them available to all workspace members.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace.dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setting simplifies dependency management across a workspace and ensures that each crate is using the same version of serde, helping to avoid conflicts and maintain consistency.&lt;/p&gt;
&lt;h3&gt;Example of Advanced cargo.toml Configuration&lt;/h3&gt;
&lt;p&gt;Here’s an example that brings together some of these advanced options:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;my_project&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
edition = &amp;quot;2021&amp;quot;

[dependencies]
serde = &amp;quot;1.0&amp;quot;

[build-dependencies]
cc = &amp;quot;1.0&amp;quot;

[patch.crates-io]
serde = { git = &amp;quot;https://github.com/your-fork/serde.git&amp;quot;, branch = &amp;quot;fix-branch&amp;quot; }

[badges]
github-actions = { repository = &amp;quot;user/my_project&amp;quot;, branch = &amp;quot;main&amp;quot;, workflow = &amp;quot;CI&amp;quot; }

[package.metadata]
custom_field = &amp;quot;This is a custom metadata field&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Benefits of Using Advanced Configurations
These advanced configuration options provide you with a wide range of tools to tailor cargo.toml to your project’s specific requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dependency Control&lt;/strong&gt;: Patch or replace dependencies to use the exact version or source you need.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Flexibility&lt;/strong&gt;: Add custom scripts or compile native dependencies with [build-dependencies].&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Documentation&lt;/strong&gt;: Use badges to make the project status visible on supported platforms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Metadata&lt;/strong&gt;: Store additional project-specific information for tools or scripts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With these configurations, cargo.toml becomes a powerful and flexible tool for managing Rust projects, accommodating both simple setups and complex requirements.&lt;/p&gt;
&lt;h2&gt;Troubleshooting and Best Practices&lt;/h2&gt;
&lt;p&gt;Working with &lt;code&gt;cargo.toml&lt;/code&gt; can be straightforward, but as your project grows, you might encounter common issues or challenges. Here are some troubleshooting tips and best practices to help you manage your &lt;code&gt;cargo.toml&lt;/code&gt; effectively.&lt;/p&gt;
&lt;h3&gt;Common Errors and Solutions&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dependency Version Conflicts&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When multiple crates depend on different versions of the same dependency, Cargo may not be able to resolve the conflict, leading to a build failure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Consider using &lt;code&gt;[patch]&lt;/code&gt; to enforce a specific version across your project, or review and align the dependency versions if possible.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[patch.crates-io]
serde = &amp;quot;1.0.104&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Missing or Unsupported Features&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you attempt to enable a feature that doesn’t exist or isn’t compatible with a dependency, Cargo will return an error.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Double-check the available features for each dependency in the documentation. Ensure that you’re spelling the feature name correctly and that it’s supported in the specified version.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Invalid cargo.toml Syntax&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Sometimes, simple syntax errors in cargo.toml, like missing brackets or commas, can cause parsing issues.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Carefully check your syntax, especially after making edits. Tools like cargo fmt can help with formatting, but a manual review can also catch issues.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Feature Flag Conflicts&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Occasionally, enabling multiple features that depend on conflicting dependencies or configurations can lead to errors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use Cargo’s conditional compilation to define feature flags carefully. Make sure dependencies don’t conflict, and test combinations of features if your project has multiple optional features.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Circular Dependencies&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Circular dependencies can happen if crates in a workspace depend on each other in a loop.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Reevaluate the dependency structure of your crates. Consider refactoring shared code into a separate crate that both depend on, rather than forming a circular chain.&lt;/p&gt;
&lt;h3&gt;Best Practices for Managing cargo.toml&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Use Semantic Versioning Thoughtfully&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When specifying dependency versions, follow semantic versioning principles. For production code, prefer specifying minor and patch versions (e.g., &lt;code&gt;&amp;quot;^1.2.3&amp;quot;&lt;/code&gt; or &lt;code&gt;&amp;quot;~1.2.3&amp;quot;&lt;/code&gt;) to avoid unexpected updates that could introduce breaking changes.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Leverage Workspaces for Large Projects&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you have a large project with multiple related components, consider organizing it into a workspace. This allows you to manage dependencies centrally, share a build directory, and simplify testing across modules.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Define Meaningful Features&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Use features to modularize your project and enable or disable components based on project needs. Avoid adding too many features that create complex interdependencies, as this can complicate both code and dependency management.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Group Dependencies by Purpose&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Organize dependencies based on their purpose, such as &lt;code&gt;[dependencies]&lt;/code&gt; for core libraries, &lt;code&gt;[dev-dependencies]&lt;/code&gt; for testing tools, and &lt;code&gt;[build-dependencies]&lt;/code&gt; for build scripts. This structure helps keep your project organized and reduces unnecessary bloat in production builds.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Keep &lt;code&gt;cargo.toml&lt;/code&gt; Clean and Well-Documented&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Use comments to explain any non-standard configurations or complex dependency requirements. This makes it easier for other contributors to understand your &lt;code&gt;cargo.toml&lt;/code&gt; file and for you to maintain it over time.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;# This dependency is only needed for JSON support
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;6&quot;&gt;
&lt;li&gt;&lt;strong&gt;Use &lt;code&gt;[workspace.dependencies]&lt;/code&gt; for Consistency&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In workspaces, declare shared dependencies in &lt;code&gt;[workspace.dependencies]&lt;/code&gt; to ensure all crates use the same version. This reduces version conflicts and keeps dependency management consistent across crates.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace.dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;7&quot;&gt;
&lt;li&gt;&lt;strong&gt;Regularly Update Dependencies&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Rust’s ecosystem evolves quickly, and keeping dependencies up-to-date ensures you benefit from the latest features, bug fixes, and performance improvements. Use cargo update to update your Cargo.lock file and check for the latest versions.&lt;/p&gt;
&lt;ol start=&quot;8&quot;&gt;
&lt;li&gt;&lt;strong&gt;Automate Testing Across Configurations&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your project uses multiple features, test all feature combinations to ensure compatibility. You can set up &lt;code&gt;CI&lt;/code&gt; (Continuous Integration) workflows to automate this process, making sure your code works across all enabled configurations.&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Managing dependencies and configurations with cargo.toml is a powerful way to structure your Rust projects. By following best practices and knowing how to troubleshoot common issues, you can maintain a clean, efficient, and resilient setup. Taking time to organize your cargo.toml file thoughtfully will pay off as your project grows, making it easier to manage and scale in the long run.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Leveraging Python&apos;s Pattern Matching and Comprehensions for Data Analytics</title><link>https://iceberglakehouse.com/posts/2024-11-Python-Analytics-Pattern-Matching/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-Python-Analytics-Pattern-Matching/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Fri, 01 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Python stands out as a powerful and versatile tool. Known for its simplicity and readability, Python provides an array of built-in features that make it an ideal language for data manipulation, analysis, and visualization. Among these features, two capabilities: pattern matching and comprehensions, offer significant advantages for transforming and structuring data efficiently.&lt;/p&gt;
&lt;p&gt;Pattern matching, introduced in Python 3.10, allows for more intuitive and readable conditional logic by enabling the matching of complex data structures with minimal code. This feature is particularly useful in data analytics when dealing with diverse data formats, nested structures, or when applying multiple conditional transformations. On the other hand, comprehensions (list, set, and dictionary comprehensions) allow for concise, readable expressions that can filter, transform, and aggregate data on the fly, making repetitive data tasks faster and less error-prone.&lt;/p&gt;
&lt;p&gt;Let&apos;s explore how these two features can help data analysts and engineers write cleaner, faster, and more readable code. We’ll dive into practical examples of how pattern matching and comprehensions can be applied to streamline data processing, showing how they simplify complex tasks and optimize data workflows. By the end, you&apos;ll have a clearer understanding of how these Python features can enhance your data analytics toolkit.&lt;/p&gt;
&lt;h2&gt;Understanding Pattern Matching in Python&lt;/h2&gt;
&lt;p&gt;Pattern matching, introduced with the &lt;code&gt;match&lt;/code&gt; and &lt;code&gt;case&lt;/code&gt; syntax in Python 3.10 (PEP 634), enables cleaner and more readable conditional logic, particularly when handling complex data structures. Unlike traditional &lt;code&gt;if-else&lt;/code&gt; chains, pattern matching lets you define specific patterns that Python will match against, simplifying code that deals with various data formats and nested structures.&lt;/p&gt;
&lt;p&gt;With pattern matching, data analysts can write expressive code to handle different data transformations and formats with minimal boilerplate. For instance, when working with datasets that contain multiple types of values: like dictionaries, nested lists, or JSON objects, pattern matching can help categorize, transform, or validate data based on structure and content.&lt;/p&gt;
&lt;h3&gt;Pattern Matching Use Cases in Data Analytics&lt;/h3&gt;
&lt;p&gt;Here are a few ways pattern matching can benefit data analytics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Transformation&lt;/strong&gt;: In data workflows, datasets often contain mixed or nested data types. Pattern matching can identify specific structures within a dataset and apply transformations based on those structures, simplifying tasks like type conversions or string manipulations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Handling Nested Data&lt;/strong&gt;: JSON files and nested dictionaries are common in data analytics. Pattern matching enables intuitive unpacking and restructuring of these nested formats, making it easier to extract insights from deeply nested data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Type Checking and Filtering&lt;/strong&gt;: When cleaning data, it’s essential to handle various data types accurately. Pattern matching can be used to check for certain types (e.g., &lt;code&gt;str&lt;/code&gt;, &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;) within a dataset, making it easy to filter out unwanted types or process each type differently for validation and transformation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Practical Applications of Pattern Matching&lt;/h2&gt;
&lt;p&gt;Pattern matching is not only a powerful concept but also extremely practical in real-world data analytics workflows. By matching specific data structures and patterns, it allows analysts to write concise code for tasks like cleaning, categorizing, and transforming data. Let’s explore a few common applications where pattern matching can simplify data processing.&lt;/p&gt;
&lt;h3&gt;Example 1: Data Cleaning with Pattern Matching&lt;/h3&gt;
&lt;p&gt;One of the first steps in any data analytics project is data cleaning. This often involves handling missing values, type mismatches, and incorrect formats. Using pattern matching, you can match specific patterns in your dataset to clean or transform the data accordingly.&lt;/p&gt;
&lt;p&gt;For example, let’s say you have a dataset where certain entries may contain &lt;code&gt;None&lt;/code&gt; values, incorrect date formats, or unexpected data types. Pattern matching enables you to handle each case concisely:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def clean_entry(entry):
    match entry:
        case None:
            return &amp;quot;Missing&amp;quot;
        case str(date) if date.isdigit():
            return f&amp;quot;2023-{date[:2]}-{date[2:]}&amp;quot;  # Convert YYMMDD to YYYY-MM-DD
        case int(value):
            return float(value)  # Convert integers to floats
        case _:
            return entry  # Keep other cases as-is
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, pattern matching simplifies handling different data cases in a single function, reducing the need for multiple if-elif checks.&lt;/p&gt;
&lt;h3&gt;Example 2: Categorizing Data&lt;/h3&gt;
&lt;p&gt;Another useful application of pattern matching is in data categorization. Suppose you have a dataset where each record has a set of attributes that can help classify the data into categories, such as product type, risk level, or customer segment. Pattern matching allows you to classify records based on attribute patterns easily.&lt;/p&gt;
&lt;p&gt;For instance, if you want to categorize customer data based on their spending patterns, you could use pattern matching to define these categories:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def categorize_customer(spending):
    match spending:
        case {&amp;quot;amount&amp;quot;: amount} if amount &amp;gt; 1000:
            return &amp;quot;High spender&amp;quot;
        case {&amp;quot;amount&amp;quot;: amount} if 500 &amp;lt; amount &amp;lt;= 1000:
            return &amp;quot;Medium spender&amp;quot;
        case {&amp;quot;amount&amp;quot;: amount} if amount &amp;lt;= 500:
            return &amp;quot;Low spender&amp;quot;
        case _:
            return &amp;quot;Unknown category&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach lets you apply rules-based categorization quickly, making your code more modular and readable.&lt;/p&gt;
&lt;h3&gt;Example 3: Mapping JSON to DataFrames&lt;/h3&gt;
&lt;p&gt;JSON data, often nested and hierarchical, can be challenging to work with directly. Pattern matching makes it easy to traverse and reshape JSON structures, allowing for direct mapping of data into pandas DataFrames. Consider the following example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pandas as pd

def json_to_dataframe(json_data):
    rows = []
    for entry in json_data:
        match entry:
            case {&amp;quot;id&amp;quot;: id, &amp;quot;attributes&amp;quot;: {&amp;quot;name&amp;quot;: name, &amp;quot;value&amp;quot;: value}}:
                rows.append({&amp;quot;ID&amp;quot;: id, &amp;quot;Name&amp;quot;: name, &amp;quot;Value&amp;quot;: value})
            case {&amp;quot;id&amp;quot;: id, &amp;quot;name&amp;quot;: name}:
                rows.append({&amp;quot;ID&amp;quot;: id, &amp;quot;Name&amp;quot;: name, &amp;quot;Value&amp;quot;: None})
            case _:
                pass  # Ignore entries that don&apos;t match any pattern
    return pd.DataFrame(rows)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This function processes JSON entries according to specific patterns and then converts them into a structured DataFrame. Pattern matching ensures only relevant data is extracted, saving time on manual transformations.&lt;/p&gt;
&lt;p&gt;In these examples, pattern matching streamlines data cleaning, categorization, and transformation tasks, making it a valuable tool for any data analyst or engineer. In the next section, we’ll explore comprehensions and how they can further simplify data manipulation tasks.&lt;/p&gt;
&lt;h2&gt;Using List, Set, and Dictionary Comprehensions&lt;/h2&gt;
&lt;p&gt;Comprehensions are one of Python’s most powerful features, allowing for concise, readable expressions that streamline data processing tasks. List, set, and dictionary comprehensions enable analysts to quickly filter, transform, and aggregate data, all within a single line of code. When dealing with large datasets or repetitive transformations, comprehensions can significantly reduce the amount of code you write, making it easier to read and maintain.&lt;/p&gt;
&lt;h3&gt;Use Cases of Comprehensions in Data Analytics&lt;/h3&gt;
&lt;p&gt;Below are some common applications of comprehensions that can greatly enhance your data manipulation workflows.&lt;/p&gt;
&lt;h3&gt;Data Filtering&lt;/h3&gt;
&lt;p&gt;Data filtering is a common task in analytics, especially when removing outliers or isolating records that meet specific criteria. List comprehensions offer a simple way to filter data efficiently. Suppose you have a list of transaction amounts and want to isolate transactions over $500:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;transactions = [100, 250, 600, 1200, 300]
high_value_transactions = [t for t in transactions if t &amp;gt; 500]
# Output: [600, 1200]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one-liner achieves in a single step what would require several lines of code with a traditional loop. Comprehensions make it easy to quickly filter data without adding much complexity.&lt;/p&gt;
&lt;h3&gt;Data Transformation&lt;/h3&gt;
&lt;p&gt;Transforming data, such as changing formats or applying functions to each element, is another common need. Let’s say you have a list of prices in USD and want to convert them to euros at a rate of 1 USD = 0.85 EUR. List comprehensions allow you to apply the conversion effortlessly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;prices_usd = [100, 200, 300]
prices_eur = [price * 0.85 for price in prices_usd]
# Output: [85.0, 170.0, 255.0]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This method is not only concise but also efficient, making it ideal for quick transformations across entire datasets.&lt;/p&gt;
&lt;h3&gt;Dictionary Aggregations&lt;/h3&gt;
&lt;p&gt;Comprehensions are also highly effective for aggregating data into dictionaries, which can be helpful for categorizing data or creating quick summaries. For instance, suppose you have a list of tuples containing product names and their sales. You could use a dictionary comprehension to aggregate these into a dictionary format:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;sales_data = [(&amp;quot;Product A&amp;quot;, 30), (&amp;quot;Product B&amp;quot;, 45), (&amp;quot;Product A&amp;quot;, 25)]
sales_summary = {product: sum(sale for p, sale in sales_data if p == product) for product, _ in sales_data}
# Output: {&apos;Product A&apos;: 55, &apos;Product B&apos;: 45}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This comprehension aggregates sales by product, providing a summary of total sales for each product without the need for multiple loops or intermediate data structures.&lt;/p&gt;
&lt;h3&gt;Set Comprehensions for Unique Values&lt;/h3&gt;
&lt;p&gt;If you need to extract unique values from a dataset, set comprehensions provide a quick and clean solution. Imagine you have a dataset with duplicate entries and want a list of unique customer IDs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;customer_ids = [101, 102, 103, 101, 104, 102]
unique_ids = {id for id in customer_ids}
# Output: {101, 102, 103, 104}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This set comprehension removes duplicates automatically, ensuring that each ID appears only once in the output.&lt;/p&gt;
&lt;h3&gt;Nested Comprehensions for Complex Transformations&lt;/h3&gt;
&lt;p&gt;In some cases, datasets may contain nested structures that require multiple levels of transformation. Nested comprehensions enable you to flatten these structures or apply transformations at each level. For instance, if you have a list of lists representing survey responses and want to normalize the data, you could use nested comprehensions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;responses = [[5, 4, 3], [3, 5, 4], [4, 4, 5]]
normalized_responses = [[score / 5 for score in response] for response in responses]
# Output: [[1.0, 0.8, 0.6], [0.6, 1.0, 0.8], [0.8, 0.8, 1.0]]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example applies a transformation to each individual score within the nested lists, enabling a consistent normalization across all responses.&lt;/p&gt;
&lt;p&gt;Comprehensions are powerful tools in any data analyst&apos;s toolkit, providing a quick way to handle repetitive data transformations, filter data, and create summary statistics. In the next section, we’ll explore how to combine pattern matching and comprehensions for even more effective data manipulation workflows.&lt;/p&gt;
&lt;h1&gt;Advanced Examples Combining Pattern Matching and Comprehensions&lt;/h1&gt;
&lt;p&gt;When used together, pattern matching and comprehensions enable even more powerful data manipulation workflows, allowing you to handle complex transformations, analyze nested data structures, and apply conditional logic in a concise, readable way. In this section, we’ll explore some advanced examples that showcase the synergy between these two features.&lt;/p&gt;
&lt;h3&gt;Complex Data Transformations&lt;/h3&gt;
&lt;p&gt;Suppose you have a dataset with different types of records, and you want to perform different transformations based on each record type. By combining pattern matching and comprehensions, you can efficiently categorize and transform each entry in one step.&lt;/p&gt;
&lt;p&gt;For instance, imagine a dataset of mixed records where each entry can be either a number, a list of numbers, or a dictionary with numerical values. Using pattern matching and comprehensions together, you can process this dataset in a single line:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;data = [5, [2, 3, 4], {&amp;quot;value&amp;quot;: 10}, 8, {&amp;quot;value&amp;quot;: 7}, [6, 9]]
transformed_data = [
    value * 2 if isinstance(value, int) else
    [x * 2 for x in value] if isinstance(value, list) else
    value[&amp;quot;value&amp;quot;] * 2 if isinstance(value, dict)
    else value
    for value in data
]
# Output: [10, [4, 6, 8], 20, 16, 14, [12, 18]]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, each type of entry is handled differently using conditional expressions and comprehensions, allowing you to transform mixed data types cleanly.&lt;/p&gt;
&lt;h3&gt;Nested Data Manipulation&lt;/h3&gt;
&lt;p&gt;When dealing with deeply nested data structures like JSON files, combining pattern matching and nested comprehensions can simplify data extraction and transformation. Imagine a dataset where each entry is a nested dictionary containing information about users, including their hobbies. You want to extract and flatten these hobbies for analysis.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;users = [
    {&amp;quot;id&amp;quot;: 1, &amp;quot;info&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;Alice&amp;quot;, &amp;quot;hobbies&amp;quot;: [&amp;quot;reading&amp;quot;, &amp;quot;hiking&amp;quot;]}},
    {&amp;quot;id&amp;quot;: 2, &amp;quot;info&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;Bob&amp;quot;, &amp;quot;hobbies&amp;quot;: [&amp;quot;cycling&amp;quot;]}},
    {&amp;quot;id&amp;quot;: 3, &amp;quot;info&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;Charlie&amp;quot;, &amp;quot;hobbies&amp;quot;: [&amp;quot;music&amp;quot;, &amp;quot;swimming&amp;quot;]}}
]
hobbies_list = [hobby for user in users for hobby in user[&amp;quot;info&amp;quot;][&amp;quot;hobbies&amp;quot;]]
# Output: [&apos;reading&apos;, &apos;hiking&apos;, &apos;cycling&apos;, &apos;music&apos;, &apos;swimming&apos;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we use nested comprehensions to access each user’s hobbies directly, extracting and flattening them into a single list. Combining comprehensions with structured data extraction saves time and simplifies code readability.&lt;/p&gt;
&lt;h3&gt;Applying Conditional Transformations with Minimal Code&lt;/h3&gt;
&lt;p&gt;Sometimes, you may want to apply transformations conditionally, based on data patterns. Let’s say you have a dataset of transactions where each transaction has an amount and a type. Using pattern matching with comprehensions, you can easily apply different transformations based on transaction type.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;transactions = [
    {&amp;quot;type&amp;quot;: &amp;quot;credit&amp;quot;, &amp;quot;amount&amp;quot;: 100},
    {&amp;quot;type&amp;quot;: &amp;quot;debit&amp;quot;, &amp;quot;amount&amp;quot;: 50},
    {&amp;quot;type&amp;quot;: &amp;quot;credit&amp;quot;, &amp;quot;amount&amp;quot;: 200},
    {&amp;quot;type&amp;quot;: &amp;quot;debit&amp;quot;, &amp;quot;amount&amp;quot;: 75}
]
processed_transactions = [
    transaction[&amp;quot;amount&amp;quot;] * 1.05 if transaction[&amp;quot;type&amp;quot;] == &amp;quot;credit&amp;quot; else
    transaction[&amp;quot;amount&amp;quot;] * 0.95
    for transaction in transactions
]
# Output: [105.0, 47.5, 210.0, 71.25]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, credits are increased by 5%, while debits are reduced by 5%. By combining pattern matching logic with comprehensions, you can apply these conditional transformations in a single step, creating a clean, readable transformation pipeline.&lt;/p&gt;
&lt;h3&gt;Summary Statistics Based on Pattern Matches&lt;/h3&gt;
&lt;p&gt;In certain scenarios, you may need to compute statistics based on patterns within your data. Suppose you have a log of events, each with a different status, and you want to calculate the count of each status type. Using pattern matching along with dictionary comprehensions, you can efficiently create a summary of each event type.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;events = [
    {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;failure&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;pending&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;failure&amp;quot;}
]

status_counts = {
    status: sum(1 for event in events if event[&amp;quot;status&amp;quot;] == status)
    for status in {event[&amp;quot;status&amp;quot;] for event in events}
}
# Output: {&apos;success&apos;: 3, &apos;failure&apos;: 2, &apos;pending&apos;: 1}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we use a set comprehension to collect unique statuses from the event log. Then, with a dictionary comprehension, we count occurrences of each status type by matching patterns within the dataset. This approach is concise and leverages both comprehensions and pattern-based logic to produce a summary efficiently.&lt;/p&gt;
&lt;h2&gt;Performance Considerations&lt;/h2&gt;
&lt;p&gt;While pattern matching and comprehensions bring efficiency and readability to data processing tasks, it’s essential to consider their performance impact, especially when working with large datasets. Understanding when and how to use these features can help you write optimal code that balances readability with speed.&lt;/p&gt;
&lt;h3&gt;Efficiency of Comprehensions&lt;/h3&gt;
&lt;p&gt;List, set, and dictionary comprehensions are generally faster than traditional loops, as they are optimized at the Python interpreter level. However, when working with very large datasets, you may encounter memory limitations since comprehensions create an entire data structure in memory. In such cases, generator expressions (using parentheses instead of square brackets) can be a memory-efficient alternative, especially when iterating over large data without needing to store all elements at once.&lt;/p&gt;
&lt;p&gt;Example with generator expression:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;large_dataset = range(1_000_000)
# Only processes items one by one, conserving memory
squared_data = (x**2 for x in large_dataset if x % 2 == 0)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using a generator here allows you to process each element on-the-fly without creating a large list in memory, making it ideal for massive datasets.&lt;/p&gt;
&lt;h3&gt;Pattern Matching in Large Datasets&lt;/h3&gt;
&lt;p&gt;Pattern matching is efficient for conditional branching and handling different data structures, but with complex nested data or highly conditional patterns, performance can be impacted. In these cases, try to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplify Patterns&lt;/strong&gt;: Use minimal and specific patterns for matches rather than broad cases, as fewer branches improve matching speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid Deep Nesting&lt;/strong&gt;: Deeply nested patterns can increase matching complexity. When dealing with deeply structured data, consider preprocessing it into a flatter structure if possible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch Processing&lt;/strong&gt;: If you need to match patterns across a large dataset, consider processing data in batches. This approach can prevent excessive memory usage and improve cache efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Pattern matching is a valuable tool when handling diverse data structures or multiple conditional cases. However, for simpler conditional logic, traditional &lt;code&gt;if-elif&lt;/code&gt; statements may offer better performance. By keeping patterns straightforward and using batch processing when necessary, you can leverage pattern matching effectively even in large datasets.&lt;/p&gt;
&lt;h3&gt;Choosing Between Pattern Matching and Traditional Methods&lt;/h3&gt;
&lt;p&gt;Pattern matching is powerful, but it’s not always the most efficient choice. In scenarios where simple conditionals (&lt;code&gt;if-elif&lt;/code&gt; statements) suffice, traditional methods may be faster due to less overhead. Use pattern matching when you need to handle multiple cases or work with nested structures, but keep simpler constructs for straightforward conditions to maintain speed.&lt;/p&gt;
&lt;h3&gt;Combining Features for Optimal Performance&lt;/h3&gt;
&lt;p&gt;When combining comprehensions and pattern matching, remember:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Limit Data Structure Size&lt;/strong&gt;: Avoid creating large intermediate data structures with comprehensions if they’re not necessary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage Generators for Streaming Data&lt;/strong&gt;: When processing large datasets with pattern matching, use generators within comprehensions or directly in your pattern-matching logic for memory-efficient processing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Pattern matching and comprehensions are powerful features for writing clear and efficient code, but they require mindful usage in performance-critical applications. By understanding how to use these features effectively, data analysts and engineers can maximize their utility while keeping code performance optimal.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Python’s pattern matching and comprehension features provide an efficient way to handle complex data transformations, conditional logic, and data filtering. By leveraging these tools, data analysts and engineers can write cleaner, more concise code that is not only easier to read but also faster to execute in many cases. Pattern matching simplifies handling diverse data structures and nested formats, making it ideal for working with JSON files, dictionaries, and mixed-type records. Meanwhile, comprehensions streamline filtering, transformation, and aggregation tasks, all within single-line expressions.&lt;/p&gt;
&lt;p&gt;When used together, these features enable powerful data manipulation workflows, allowing you to handle large datasets with complex structures or conditional needs effectively. However, as with any tool, it’s essential to consider performance and memory implications, especially when working with very large datasets. By incorporating strategies like generator expressions and batch processing, you can make your pattern matching and comp&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hands-on with Apache Iceberg &amp; Dremio on Your Laptop within 10 Minutes</title><link>https://iceberglakehouse.com/posts/2024-10-hands-on-with-iceberg-dremio-laptop/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-hands-on-with-iceberg-dremio-laptop/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_...</description><pubDate>Thu, 31 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberggov&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberggov&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Efficiently managing and analyzing data is essential for business success, and the data lakehouse architecture is leading the way in making this easier and more cost-effective. By combining the flexibility of data lakes with the structured performance of data warehouses, lakehouses offer a powerful solution for data storage, querying, and governance.&lt;/p&gt;
&lt;p&gt;For this hands-on guide, we’ll dive into setting up a data lakehouse on your own laptop in just ten minutes using &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;Nessie&lt;/strong&gt;, and &lt;strong&gt;Apache Iceberg&lt;/strong&gt;. This setup will enable you to perform analytics on your data seamlessly and leverage a versioned, Git-like approach to data management with pre-configured storage buckets for simplicity.&lt;/p&gt;
&lt;h3&gt;Tools We’ll Use:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt;: A lakehouse platform that organizes, documents, and queries data from databases, data warehouses, data lakes and lakehouse catalogs in a unified semantic layer, providing seamless access to data for analytics and reporting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;: A transactional catalog that enables Git-like branching and merging capabilities for data, allowing for easier experimentation and version control.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;: A data lakehouse table format that turns your data lake into an ACID-compliant structure, supporting operations like time travel, schema evolution, and advanced partitioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By the end of this tutorial, you’ll be ready to set up a local lakehouse environment quickly, complete with sample data to explore. Let’s get started and see how easy it can be to work with Dremio and Apache Iceberg on your laptop!&lt;/p&gt;
&lt;h2&gt;Environment Setup&lt;/h2&gt;
&lt;p&gt;Before diving into the data lakehouse setup, let’s ensure your environment is ready. We’ll use &lt;strong&gt;Docker&lt;/strong&gt;, a tool that allows you to run applications in isolated environments called &amp;quot;containers.&amp;quot; If you’re new to Docker, don’t worry: this guide will walk you through each step!&lt;/p&gt;
&lt;h3&gt;Step 1: Install Docker&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Docker&lt;/strong&gt;: Go to &lt;a href=&quot;https://www.docker.com/products/docker-desktop/&quot;&gt;docker.com&lt;/a&gt; and download Docker Desktop for your operating system (Windows, macOS, or Linux).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Docker&lt;/strong&gt;: Follow the installation instructions for your operating system. This will include some on-screen prompts to complete the installation process.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify Installation&lt;/strong&gt;: After installing Docker, open a terminal (Command Prompt, PowerShell, or a terminal app on Linux/macOS) and type:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;   docker --version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command should display the version number if Docker is successfully installed.&lt;/p&gt;
&lt;p&gt;Once Docker is installed and running, you’ll have the core tool needed to set up our data lakehouse.&lt;/p&gt;
&lt;h3&gt;Step 2: Create a Docker Compose File&lt;/h3&gt;
&lt;p&gt;With Docker installed, let’s move on to Docker Compose, a tool that helps you define and manage multiple containers with a single configuration file. We’ll use it to set up and start Dremio, Nessie, and MinIO (an S3-compatible storage solution). Docker Compose will also automatically create the storage &amp;quot;buckets&amp;quot; needed in MinIO, so you won’t need to configure them manually.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open a Text Editor:&lt;/strong&gt; Open any text editor (like VS Code, Notepad, or Sublime Text) and create a new file called docker-compose.yml in a new, empty folder. This file will contain all the configuration needed to launch our environment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Add the Docker Compose Configuration:&lt;/strong&gt; Copy the following code and paste it into the docker-compose.yml file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &amp;quot;3&amp;quot;

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      - iceberg
    ports:
      - 19120:19120
  # MinIO Storage Server
  ## Creates two buckets named lakehouse and lake
  ## tail -f /dev/null is to keep the container running
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    networks:
      - iceberg
    ports:
      - 9001:9001
      - 9000:9000
    command: [&amp;quot;server&amp;quot;, &amp;quot;/data&amp;quot;, &amp;quot;--console-address&amp;quot;, &amp;quot;:9001&amp;quot;]
    entrypoint: &amp;gt;
      /bin/sh -c &amp;quot;
      minio server /data --console-address &apos;:9001&apos; &amp;amp;
      sleep 5 &amp;amp;&amp;amp;
      mc alias set myminio http://localhost:9000 admin password &amp;amp;&amp;amp;
      mc mb myminio/lakehouse &amp;amp;&amp;amp;
      mc mb myminio/lake &amp;amp;&amp;amp;
      tail -f /dev/null
      &amp;quot;
  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      - iceberg

networks:
  iceberg:
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explanation of the Code:&lt;/h3&gt;
&lt;p&gt;This file defines three services:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;nessie (the catalog)&lt;/li&gt;
&lt;li&gt;minio (the storage server)&lt;/li&gt;
&lt;li&gt;dremio (the query engine).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each service has specific network settings, ports, and configurations to allow them to communicate with each other.&lt;/p&gt;
&lt;h3&gt;Step 3: Start Your Environment&lt;/h3&gt;
&lt;p&gt;With your docker-compose.yml file saved, it’s time to start your data lakehouse environment!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open a Terminal:&lt;/strong&gt; Navigate to the folder where you saved the docker-compose.yml file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Run Docker Compose:&lt;/strong&gt; In your terminal, type:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command tells Docker to start each of the services specified in docker-compose.yml and run them in the background (the -d flag).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wait for Setup to Complete:&lt;/strong&gt; It may take a few minutes for all services to start. You’ll see a lot of text in your terminal as each service starts up. When you see lines indicating that each service is &amp;quot;running,&amp;quot; the setup is complete.&lt;/p&gt;
&lt;h3&gt;Step 4: Verify Each Service is Running&lt;/h3&gt;
&lt;p&gt;Now that the environment is up, let’s verify that each service is accessible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio:&lt;/strong&gt; Open a web browser and go to http://localhost:9047. You should see a Dremio login screen.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO:&lt;/strong&gt; In a new browser tab, go to http://localhost:9001. Log in with the username admin and password password. You should see the MinIO console, where you can view storage &amp;quot;buckets.&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 5: Optional - Shutting Down the Environment&lt;/h3&gt;
&lt;p&gt;When you’re done with the setup and want to stop the services, simply open a terminal in the same folder where you created the &lt;code&gt;docker-compose.yml&lt;/code&gt; file and run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down -v
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will stop and remove all containers, so you can start fresh next time.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-v&lt;/code&gt; flag removes any volumes associated with the containers, which is important if you want to start fresh next time.&lt;/p&gt;
&lt;p&gt;Congratulations! You now have a fully functional data lakehouse environment running on your laptop. In the next section, we’ll connect Dremio to Nessie and MinIO and start creating and querying tables.&lt;/p&gt;
&lt;h2&gt;Getting Started with Dremio: Connecting the Nessie and MinIO Sources&lt;/h2&gt;
&lt;p&gt;Now that Dremio is up and running, let&apos;s connect it to our MinIO buckets, &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;lake&lt;/code&gt;, which will act as the main data sources in our local lakehouse environment. This section will guide you through connecting both the Nessie catalog (using the &lt;code&gt;lakehouse&lt;/code&gt; bucket) and a general S3-like data lake connection (using the &lt;code&gt;lake&lt;/code&gt; bucket) in Dremio.&lt;/p&gt;
&lt;h3&gt;Step 1: Adding the Nessie Source in Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open Dremio&lt;/strong&gt;: In your web browser, navigate to &lt;a href=&quot;http://localhost:9047&quot;&gt;http://localhost:9047&lt;/a&gt; to access the Dremio UI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add the Nessie Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Click on the &lt;strong&gt;&amp;quot;Add Source&amp;quot;&lt;/strong&gt; button in the bottom left corner of the Dremio interface.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Nessie&lt;/strong&gt; from the list of available sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the Nessie Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’ll need to fill out both the &lt;strong&gt;General&lt;/strong&gt; and &lt;strong&gt;Storage&lt;/strong&gt; settings as follows:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Set the source name to &lt;code&gt;lakehouse&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Endpoint URL&lt;/strong&gt;: Enter the Nessie API endpoint URL:&lt;pre&gt;&lt;code&gt;http://nessie:19120/api/v2
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Select &lt;strong&gt;None&lt;/strong&gt; (no additional credentials are required).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Storage Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Set to &lt;code&gt;admin&lt;/code&gt; (MinIO username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Set to &lt;code&gt;password&lt;/code&gt; (MinIO password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;lakehouse&lt;/code&gt; (this is the bucket where our Iceberg tables will be stored).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dremio.s3.compat&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option since we’re running Nessie locally on HTTP.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;: Once all settings are configured, click &lt;strong&gt;Save&lt;/strong&gt;. The &lt;code&gt;lakehouse&lt;/code&gt; source will now be connected in Dremio, allowing you to browse and query tables stored in the Nessie catalog.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Adding MinIO as an S3 Source in Dremio (Data Lake Connection)&lt;/h3&gt;
&lt;p&gt;In addition to Nessie, we’ll set up a general-purpose data lake connection using the &lt;code&gt;lake&lt;/code&gt; bucket in MinIO. This bucket can store non-Iceberg table data, making it suitable for raw data or other types of files. So if you wanted to upload CSV, JSON, XLS or Parquet files you can put them in the &amp;quot;lake&amp;quot; bucket and view them from this source in Dremio.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add an S3 Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Click the &lt;strong&gt;&amp;quot;Add Source&amp;quot;&lt;/strong&gt; button again and select &lt;strong&gt;S3&lt;/strong&gt; from the list of sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the S3 Source for MinIO&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the following settings to connect the &lt;code&gt;lake&lt;/code&gt; bucket as a secondary source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Set the source name to &lt;code&gt;lake&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials&lt;/strong&gt;: Choose &lt;strong&gt;AWS access key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Set to &lt;code&gt;admin&lt;/code&gt; (MinIO username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Set to &lt;code&gt;password&lt;/code&gt; (MinIO password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option since MinIO is running locally.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Advanced Options&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Enable Compatibility Mode&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt; to ensure compatibility with MinIO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;/lake&lt;/code&gt; (the bucket name for general storage).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;: After filling out the configuration, click &lt;strong&gt;Save&lt;/strong&gt;. The &lt;code&gt;lake&lt;/code&gt; bucket is now accessible in Dremio, and you can query the raw data stored in this bucket.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;With both sources connected, you now have access to structured, versioned data in the &lt;code&gt;lakehouse&lt;/code&gt; bucket and general-purpose data in the &lt;code&gt;lake&lt;/code&gt; bucket. In the next section, we’ll explore creating and querying Apache Iceberg tables in Dremio to see how easy it is to get started with data lakehouse workflows.&lt;/p&gt;
&lt;h2&gt;Running Transactions on Apache Iceberg Tables and Inspecting the Storage&lt;/h2&gt;
&lt;p&gt;With our environment set up and sources connected, we’re ready to perform some transactions on an Apache Iceberg table in Dremio. After creating and inserting data, we’ll inspect MinIO to see how Dremio stores files in the &lt;code&gt;lakehouse&lt;/code&gt; bucket. Additionally, we’ll make a &lt;code&gt;curl&lt;/code&gt; request to Nessie to check the catalog state, confirming our transactions.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating an Iceberg Table in Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the Dremio UI, select &lt;strong&gt;SQL Runner&lt;/strong&gt; from the menu on the left.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Set the Context to Nessie&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the SQL editor, click on &lt;strong&gt;Context&lt;/strong&gt; (top right of the editor) and set it to our Nessie source &lt;code&gt;lakehouse&lt;/code&gt;. If you don&apos;t do this then you&apos;ll need to include fully qualified table names in your queries like &lt;code&gt;lakehouse.customers&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create an Iceberg Table&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to create a new table named &lt;code&gt;customers&lt;/code&gt; in the &lt;code&gt;lakehouse&lt;/code&gt; bucket:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE customers (
  id INT,
  first_name VARCHAR,
  last_name VARCHAR,
  age INT
) PARTITION BY (truncate(1, last_name));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This SQL creates an Apache Iceberg table with a partition on the first letter of &lt;code&gt;last_name&lt;/code&gt;. The partitioning is handled by Apache Iceberg’s &lt;strong&gt;Hidden Partitioning&lt;/strong&gt; feature, which allows for advanced partitioning without additional columns in the schema.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Data into the Table&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now, add some sample data to the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age) VALUES
(1, &apos;John&apos;, &apos;Doe&apos;, 28),
(2, &apos;Jane&apos;, &apos;Smith&apos;, 34),
(3, &apos;Alice&apos;, &apos;Johnson&apos;, 22),
(4, &apos;Bob&apos;, &apos;Williams&apos;, 45),
(5, &apos;Charlie&apos;, &apos;Brown&apos;, 30);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This will insert five records into the &lt;code&gt;customers&lt;/code&gt; table, each automatically stored and partitioned in the &lt;code&gt;lakehouse&lt;/code&gt; bucket.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Inspecting Files in MinIO&lt;/h3&gt;
&lt;p&gt;With data inserted into the &lt;code&gt;customers&lt;/code&gt; table, let’s take a look at MinIO to verify the files were created as expected.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open MinIO&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to &lt;a href=&quot;http://localhost:9001&quot;&gt;http://localhost:9001&lt;/a&gt; in your browser, and log in with:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Username&lt;/strong&gt;: &lt;code&gt;admin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Password&lt;/strong&gt;: &lt;code&gt;password&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Navigate to the &lt;code&gt;lakehouse&lt;/code&gt; Bucket&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;From the MinIO dashboard, click on &lt;strong&gt;Buckets&lt;/strong&gt; and select the &lt;code&gt;lakehouse&lt;/code&gt; bucket.&lt;/li&gt;
&lt;li&gt;Inside the &lt;code&gt;lakehouse&lt;/code&gt; bucket, you should see a directory for the &lt;code&gt;customers&lt;/code&gt; table.&lt;/li&gt;
&lt;li&gt;Browse through the folders to locate the partitioned files based on the &lt;code&gt;last_name&lt;/code&gt; column. You’ll find subfolders that store the data by partition, along with metadata files that track the state of the table.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This inspection verifies that Dremio is writing data to the &lt;code&gt;lakehouse&lt;/code&gt; bucket in Apache Iceberg format, which organizes the data into Parquet files and metadata files.&lt;/p&gt;
&lt;h3&gt;Step 3: Checking the State of the Nessie Catalog with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Now, let’s make a &lt;code&gt;curl&lt;/code&gt; request to the Nessie catalog to confirm that the &lt;code&gt;customers&lt;/code&gt; table was created successfully and that its metadata is stored correctly.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open a Terminal&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In your terminal, run the following command to view the contents of the main branch in Nessie:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/main/entries&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command retrieves a list of all entries (tables) in the &lt;code&gt;main&lt;/code&gt; branch of the Nessie catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review the Response&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The JSON response will contain details about the &lt;code&gt;customers&lt;/code&gt; table. You should see an entry indicating the presence of &lt;code&gt;customers&lt;/code&gt; in the catalog, confirming that the table is tracked in Nessie.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Inspect Specific Commit History (Optional)&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To view the specific commit history for transactions on this branch, you can run:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/tree/main/log&amp;quot; \
     -H &amp;quot;Content-Type: application/json&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command shows a log of all changes made on the &lt;code&gt;main&lt;/code&gt; branch, providing a Git-like commit history for your data transactions.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;Now that you have verified your transactions and inspected the storage, you can confidently work with Apache Iceberg tables in Dremio, knowing that both the data and metadata are tracked in the Nessie catalog and accessible in MinIO. In the next section, we’ll explore making additional table modifications, like updating partitioning rules, and see how Apache Iceberg handles these changes seamlessly.&lt;/p&gt;
&lt;h2&gt;Modifying the Apache Iceberg Table Schema and Partitioning&lt;/h2&gt;
&lt;p&gt;With our initial &lt;code&gt;customers&lt;/code&gt; table set up in Dremio, we can take advantage of Apache Iceberg’s flexibility to make schema and partition modifications without requiring a data rewrite. In this section, we’ll add a new column to the table, adjust partitioning, and observe how these changes reflect in MinIO and the Nessie catalog.&lt;/p&gt;
&lt;h3&gt;Step 1: Adding a New Column&lt;/h3&gt;
&lt;p&gt;Suppose we want to add a new column to store customer email addresses. We can easily update the table schema with the following &lt;code&gt;ALTER TABLE&lt;/code&gt; statement:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Navigate back to the &lt;strong&gt;SQL Runner&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add the Column&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to add an &lt;code&gt;email&lt;/code&gt; column to the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE customers
ADD COLUMNS (email VARCHAR);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command adds the &lt;code&gt;email&lt;/code&gt; column to the existing table without affecting the existing data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Column Addition&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After running the command, you can confirm the addition by querying the &lt;code&gt;customers&lt;/code&gt; table in Dremio:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You’ll see an &lt;code&gt;email&lt;/code&gt; column now appears, ready for data to be added.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Updating Partitioning Rules&lt;/h3&gt;
&lt;p&gt;Iceberg allows for flexible partitioning rules through &lt;strong&gt;Partition Evolution&lt;/strong&gt;, meaning we can change how data is partitioned without rewriting all existing data. Let’s add a new partition rule that organizes data based on the first letter of the &lt;code&gt;first_name&lt;/code&gt; as well.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a Partition Field&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To partition data by the first letter of &lt;code&gt;first_name&lt;/code&gt;, use the following SQL:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE customers
ADD PARTITION FIELD truncate(1, first_name);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command instructs Iceberg to partition any new data by both the first letters of &lt;code&gt;last_name&lt;/code&gt; and &lt;code&gt;first_name&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Additional Data to Test the New Partitioning&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let’s insert some more records to see how the new partition structure organizes the data:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age, email) VALUES
(6, &apos;Emily&apos;, &apos;Adams&apos;, 29, &apos;emily.adams@example.com&apos;),
(7, &apos;Frank&apos;, &apos;Baker&apos;, 35, &apos;frank.baker@example.com&apos;),
(8, &apos;Grace&apos;, &apos;Clark&apos;, 41, &apos;grace.clark@example.com&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This data will be partitioned according to both &lt;code&gt;first_name&lt;/code&gt; and &lt;code&gt;last_name&lt;/code&gt;, following the new rules we set.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Inspect the New Partitions in MinIO&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open MinIO&lt;/strong&gt; and navigate to the &lt;code&gt;lakehouse&lt;/code&gt; bucket:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to &lt;a href=&quot;http://localhost:9001&quot;&gt;http://localhost:9001&lt;/a&gt;, and log in with:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Username&lt;/strong&gt;: &lt;code&gt;admin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Password&lt;/strong&gt;: &lt;code&gt;password&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Locate the Updated &lt;code&gt;customers&lt;/code&gt; Folder&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Within the &lt;code&gt;lakehouse&lt;/code&gt; bucket, locate the &lt;code&gt;customers&lt;/code&gt; table folder.&lt;/li&gt;
&lt;li&gt;Open the folder structure to view the newly created subfolders, representing the partitioning by &lt;code&gt;last_name&lt;/code&gt; and &lt;code&gt;first_name&lt;/code&gt; that we configured. You should see the additional folders and Parquet files for each new partition based on &lt;code&gt;first_name&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Confirm the Changes in Nessie with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Finally, let’s make a &lt;code&gt;curl&lt;/code&gt; request to the Nessie catalog to verify that the schema and partitioning changes are recorded in the catalog’s metadata.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open a Terminal&lt;/strong&gt; and run the following command to check the schema:
&lt;code&gt;bash curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/main/history&amp;quot; &lt;/code&gt;
This will return a JSON response listing the recent commits to the &lt;code&gt;main&lt;/code&gt; branch, including the schema and partitioning updates.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;We’ve successfully modified the schema and partitioning of an Apache Iceberg table in Dremio, and we can observe these changes directly in MinIO’s file structure and the Nessie catalog’s metadata. This example demonstrates the flexibility of Iceberg in managing evolving data schemas and partitioning strategies in real-time, without requiring downtime or data rewrites. In the next section, we’ll explore how to utilize Iceberg’s version control capabilities for branching and merging datasets within the Nessie catalog.&lt;/p&gt;
&lt;h2&gt;Branching and Merging with Nessie: Version Control for Data&lt;/h2&gt;
&lt;p&gt;One of the powerful features of using Nessie with Apache Iceberg is its Git-like branching and merging functionality. Branching allows you to create isolated environments for data modifications, which can then be merged back into the main branch once verified. This section will walk you through creating a branch, performing data modifications within that branch, and then merging those changes back to the main branch.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating a Branch&lt;/h3&gt;
&lt;p&gt;Let’s start by creating a new branch in Nessie. This branch will allow us to perform data transactions without impacting the main data branch, ideal for testing and experimenting.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a New Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to create a new branch named &lt;code&gt;development&lt;/code&gt; in the &lt;code&gt;lakehouse&lt;/code&gt; catalog:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE BRANCH development IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command creates a new branch in the Nessie catalog, providing an isolated environment for data changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Switch to the Development Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now, let’s set our context to the &lt;code&gt;development&lt;/code&gt; branch either using the context selector or using the following sql before any queries so that any changes we make only affect this branch:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH development IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Performing Data Modifications on the Branch&lt;/h3&gt;
&lt;p&gt;With the &lt;code&gt;development&lt;/code&gt; branch active, let’s modify the &lt;code&gt;customers&lt;/code&gt; table by adding new data. This data will remain isolated on the &lt;code&gt;development&lt;/code&gt; branch until we choose to merge it back to &lt;code&gt;main&lt;/code&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Additional Records&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to add new entries to the &lt;code&gt;customers&lt;/code&gt; table (make sure to either use the context selector or use the &lt;code&gt;use branch&lt;/code&gt; sql before any queries so that any changes we make only affect this branch):&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age, email) VALUES
(9, &apos;Holly&apos;, &apos;Grant&apos;, 31, &apos;holly.grant@example.com&apos;),
(10, &apos;Ian&apos;, &apos;Young&apos;, 27, &apos;ian.young@example.com&apos;),
(11, &apos;Jack&apos;, &apos;Diaz&apos;, 39, &apos;jack.diaz@example.com&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;These records are added to the &lt;code&gt;customers&lt;/code&gt; table on the &lt;code&gt;development&lt;/code&gt; branch only, meaning they won’t affect the main branch until merged.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Records in the Development Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can verify the new records by running:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers AT BRANCH development;
SELECT * FROM customers AT BRANCH main;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This query will display the data, including the recently inserted records, as it is within the context of the &lt;code&gt;development&lt;/code&gt; and &lt;code&gt;main&lt;/code&gt; branches.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Merging Changes Back to the Main Branch&lt;/h3&gt;
&lt;p&gt;Once satisfied with the changes in &lt;code&gt;development&lt;/code&gt;, we can merge the &lt;code&gt;development&lt;/code&gt; branch back into &lt;code&gt;main&lt;/code&gt;, making these records available to all users accessing the main branch.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Switch to the Main Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First, change the context back to the &lt;code&gt;main&lt;/code&gt; branch:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Merge the Development Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now, merge the &lt;code&gt;development&lt;/code&gt; branch into &lt;code&gt;main&lt;/code&gt; using the following SQL:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE BRANCH development INTO main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command brings all changes from &lt;code&gt;development&lt;/code&gt; into &lt;code&gt;main&lt;/code&gt;, adding the new records to the main version of the &lt;code&gt;customers&lt;/code&gt; table.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Merge&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To confirm the records are now in &lt;code&gt;main&lt;/code&gt;, run:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers AT BRANCH main;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You should see all records, including those added in the &lt;code&gt;development&lt;/code&gt; branch, are now present in the &lt;code&gt;main&lt;/code&gt; branch.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Verifying the Branching Activity in Nessie with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;You can use &lt;code&gt;curl&lt;/code&gt; commands to check the branch status and view commit logs in Nessie, providing additional validation of the branching and merging activity.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;List Branches&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following &lt;code&gt;curl&lt;/code&gt; command to list all branches in the &lt;code&gt;lakehouse&lt;/code&gt; catalog:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The response will include the &lt;code&gt;main&lt;/code&gt; and &lt;code&gt;development&lt;/code&gt; branches, confirming the branch creation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check the Commit Log&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;To view a log of commits, including the merge from &lt;code&gt;development&lt;/code&gt; to &lt;code&gt;main&lt;/code&gt;, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/main/history&amp;quot;

curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/development/history&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;This log will show each commit, giving you a clear view of data versioning over time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Branching and merging in Nessie allows you to safely experiment with data modifications in an isolated environment, integrating those changes back into the main dataset only when ready. This workflow is invaluable for testing data updates, creating data snapshots, or managing changes for compliance purposes. In the next section, we’ll explore how to use Nessie tags to mark important states in your data, further enhancing data version control.&lt;/p&gt;
&lt;h2&gt;Tagging Important States with Nessie: Creating Data Snapshots&lt;/h2&gt;
&lt;p&gt;In addition to branching, Nessie also offers the ability to tag specific states of your data, making it easy to create snapshots at critical moments. Tags allow you to mark key data versions: such as a quarterly report cutoff or pre-migration data state, so you can refer back to them if needed.&lt;/p&gt;
&lt;p&gt;In this section, we’ll walk through creating tags in Nessie to capture the current state of the data and explore how to use tags for historical analysis or recovery.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating a Tag&lt;/h3&gt;
&lt;p&gt;Let’s create a tag on the &lt;code&gt;main&lt;/code&gt; branch to mark an important point in the dataset, such as the completion of initial data loading. This tag will serve as a snapshot that we can return to if necessary.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a Tag&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL command to create a tag called &lt;code&gt;initial_load&lt;/code&gt; on the &lt;code&gt;main&lt;/code&gt; branch:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TAG initial_load AT BRANCH main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This tag marks the state of all tables in the &lt;code&gt;lakehouse&lt;/code&gt; catalog on the &lt;code&gt;main&lt;/code&gt; branch at the current moment, capturing the data exactly as it is now.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Modifying the Data on the Main Branch&lt;/h3&gt;
&lt;p&gt;To understand the usefulness of tags, let’s make a few changes to the &lt;code&gt;customers&lt;/code&gt; table on the &lt;code&gt;main&lt;/code&gt; branch. Later, we can use the tag to compare or even restore to the original dataset state if needed.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Additional Records&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Add some new data to the &lt;code&gt;customers&lt;/code&gt; table to simulate further data processing:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age, email) VALUES
(12, &apos;Kate&apos;, &apos;Morgan&apos;, 45, &apos;kate.morgan@example.com&apos;),
(13, &apos;Luke&apos;, &apos;Rogers&apos;, 33, &apos;luke.rogers@example.com&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify Changes&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following query to confirm that the new records have been added:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Accessing Data from a Specific Tag&lt;/h3&gt;
&lt;p&gt;Tags in Nessie allow you to view the dataset as it was at the time the tag was created. To access the data at the &lt;code&gt;initial_load&lt;/code&gt; state, we can specify the tag as the reference point in our queries.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query the Data Using the Tag&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the following SQL command to switch to the &lt;code&gt;initial_load&lt;/code&gt; tag and view the dataset as it was at that point:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE TAG initial_load IN lakehouse;
SELECT * FROM customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This query will display the &lt;code&gt;customers&lt;/code&gt; table as it was when the &lt;code&gt;initial_load&lt;/code&gt; tag was created, without the new records that were added afterward.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Return to the Main Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Once you are done exploring the &lt;code&gt;initial_load&lt;/code&gt; state, switch back to the &lt;code&gt;main&lt;/code&gt; branch to continue working with the latest data:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Verifying the Tag Creation with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;To verify the tag’s existence in the Nessie catalog, we can make a &lt;code&gt;curl&lt;/code&gt; request to list all tags, including &lt;code&gt;initial_load&lt;/code&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;List Tags&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following &lt;code&gt;curl&lt;/code&gt; command to retrieve all tags in the &lt;code&gt;lakehouse&lt;/code&gt; catalog:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/tags&amp;quot; \
     -H &amp;quot;Content-Type: application/json&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The JSON response will list all tags, including the &lt;code&gt;initial_load&lt;/code&gt; tag you created.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review Tag Details&lt;/strong&gt; (Optional):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To get detailed information about the &lt;code&gt;initial_load&lt;/code&gt; tag, including its exact commit reference, you can use:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/tags/initial_load&amp;quot; \
     -H &amp;quot;Content-Type: application/json&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Tags in Nessie provide a reliable way to snapshot important states of your data. By creating tags at critical points, you can easily access previous states of your data, helping to support data auditing, historical reporting, and data recovery. In the next section, we’ll cover querying the Apache Iceberg Metadata tables.&lt;/p&gt;
&lt;h2&gt;Exploring Iceberg Metadata Tables in Dremio&lt;/h2&gt;
&lt;p&gt;Iceberg metadata tables offer insights into the underlying structure and evolution of your data. These tables contain information about data files, snapshots, partition details, and more, allowing you to track changes, troubleshoot issues, and optimize queries. Dremio makes querying Iceberg metadata simple, giving you valuable context on your data lakehouse.&lt;/p&gt;
&lt;p&gt;In this section, we’ll explore the following Iceberg metadata tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;table_files&lt;/code&gt;: Lists data files and their statistics.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_history&lt;/code&gt;: Displays historical snapshots.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_manifests&lt;/code&gt;: Shows metadata about manifest files.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_partitions&lt;/code&gt;: Provides details on partitions.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_snapshot&lt;/code&gt;: Shows information on each snapshot.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 1: Querying Data File Metadata with &lt;code&gt;table_files&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;table_files&lt;/code&gt; metadata table provides details on each data file in the table, such as the file path, size, record count, and more. This is useful for understanding storage distribution and optimizing queries.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query the Data Files&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL command to retrieve data file information for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_files(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You’ll see results with columns like &lt;code&gt;file_path&lt;/code&gt;, &lt;code&gt;file_size_in_bytes&lt;/code&gt;, &lt;code&gt;record_count&lt;/code&gt;, and more, giving insights into each file&apos;s specifics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Exploring Table History with &lt;code&gt;table_history&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Iceberg tracks the history of a table’s snapshots, which allows you to review past states or even perform time-travel queries. The &lt;code&gt;table_history&lt;/code&gt; table displays each snapshot’s ID and timestamp.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query the Table History&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Use the following SQL to retrieve the history of the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_history(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This query will return a list of snapshots, showing when each snapshot was created (&lt;code&gt;made_current_at&lt;/code&gt;), the &lt;code&gt;snapshot_id&lt;/code&gt;, and any &lt;code&gt;parent_id&lt;/code&gt; linking to previous snapshots.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Analyzing Manifests with &lt;code&gt;table_manifests&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Manifest files are metadata files in Iceberg that track changes in data files. The &lt;code&gt;table_manifests&lt;/code&gt; table lets you inspect details like the number of files added or removed per snapshot, helping you monitor data evolution and resource usage.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Manifest Metadata&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL to view manifest metadata for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_manifests(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The results will include fields like &lt;code&gt;path&lt;/code&gt;, &lt;code&gt;added_data_files_count&lt;/code&gt;, and &lt;code&gt;deleted_data_files_count&lt;/code&gt;, which show how each manifest contributes to the table’s state.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Reviewing Partition Information with &lt;code&gt;table_partitions&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;table_partitions&lt;/code&gt; table provides details on each partition in the table, including the number of records and files in each partition. This helps with understanding how data is distributed across partitions and can be used to fine-tune partitioning strategies.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Partition Statistics&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following query to get partition statistics for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_partitions(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You’ll see fields such as &lt;code&gt;partition&lt;/code&gt;, &lt;code&gt;record_count&lt;/code&gt;, and &lt;code&gt;file_count&lt;/code&gt;, which show the breakdown of data across partitions, helping identify skewed partitions or performance bottlenecks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 5: Examining Snapshots with &lt;code&gt;table_snapshot&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;table_snapshot&lt;/code&gt; table provides a summary of each snapshot, including the operation (e.g., &lt;code&gt;append&lt;/code&gt;, &lt;code&gt;overwrite&lt;/code&gt;), the commit timestamp, and any manifest files associated with the snapshot.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Snapshot Information&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL to see snapshot details for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_snapshot(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The result will include fields like &lt;code&gt;committed_at&lt;/code&gt;, &lt;code&gt;operation&lt;/code&gt;, and &lt;code&gt;summary&lt;/code&gt;, providing a high-level view of each snapshot and its impact on the table.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Using Metadata for Time-Travel Queries&lt;/h3&gt;
&lt;p&gt;The Iceberg metadata tables also support time-travel queries, enabling you to query the data as it was at a specific snapshot or timestamp. This can be especially useful for auditing, troubleshooting, or recreating analysis from past data states.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Perform a Time-Travel Query&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Suppose you want to view the data in the &lt;code&gt;customers&lt;/code&gt; table at a specific snapshot. First, retrieve the &lt;code&gt;snapshot_id&lt;/code&gt; using the &lt;code&gt;table_history&lt;/code&gt; or &lt;code&gt;table_snapshot&lt;/code&gt; table.&lt;/li&gt;
&lt;li&gt;Then, run a query like the following to access data at that snapshot:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers AT SNAPSHOT &apos;&amp;lt;snapshot_id&amp;gt;&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;&amp;lt;snapshot_id&amp;gt;&lt;/code&gt; with the ID from the metadata tables to view the data as it was at that specific point.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Iceberg metadata tables in Dremio provide a wealth of information on table structure, partitioning, and versioning. These tables are essential for monitoring table evolution, diagnosing performance issues, and executing advanced analytics tasks like time travel.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Congratulations! You’ve just set up a powerful data lakehouse environment on your laptop with Apache Iceberg, Dremio, and Nessie, and explored hands-on techniques for managing and analyzing data. By leveraging the strengths of these open-source tools, you now have the flexibility of data lakes with the performance and reliability of data warehouses - right on your local machine.&lt;/p&gt;
&lt;p&gt;From creating and querying Iceberg tables to managing branches and snapshots with Nessie’s Git-like controls, you’ve seen how this stack can simplify complex data workflows. Using Dremio’s intuitive interface, you connected sources, ran queries, explored metadata, and learned how to use Iceberg&apos;s versioning and partitioning capabilities for powerful insights. Iceberg metadata tables also provide detailed information on data structure, making it easy to track changes, optimize storage, and even run time-travel queries.&lt;/p&gt;
&lt;p&gt;This hands-on setup is just the beginning. As your data grows, you can explore Dremio’s cloud deployment options and advanced features like reflections and incremental refreshes for scaling analytics. By mastering this foundational environment, you’re well-prepared to build efficient, scalable data lakehouse solutions that balance data accessibility, cost savings, and performance.&lt;/p&gt;
&lt;p&gt;If you enjoyed this experience, consider diving deeper into Dremio Cloud or &lt;a href=&quot;https://www.dremio.com/blog/evaluating-dremio-deploying-a-single-node-instance-on-a-vm/?utm_source=ev_externalblog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=handson10minutes&amp;amp;utm_content=alexmerced&quot;&gt;exploring further capabilities with Iceberg and Nessie by deploying a self-managed single node instance&lt;/a&gt;. Happy querying!&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling - Entities and Events</title><link>https://iceberglakehouse.com/posts/2024-10-data-modeling-events-entities/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-data-modeling-events-entities/</guid><description>
Structuring data thoughtfully is critical for both operational efficiency and analytical value. Data modeling helps us define the relationships, cons...</description><pubDate>Wed, 30 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Structuring data thoughtfully is critical for both operational efficiency and analytical value. Data modeling helps us define the relationships, constraints, and organization of data within our systems. One of the key decisions in data modeling is choosing between modeling for events or entities. Both approaches offer unique insights, but deciding when to use each can make or break the effectiveness of a data platform.&lt;/p&gt;
&lt;p&gt;In this blog, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The core differences between events and entities in data modeling&lt;/li&gt;
&lt;li&gt;When to model for events versus entities&lt;/li&gt;
&lt;li&gt;Practical considerations and tips for structuring both event and entity models&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What are Events and Entities in Data Modeling?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Entities&lt;/strong&gt; are the core objects or concepts we want to capture in a data model, such as “customer,” “product,” or “order.” Entities generally have attributes that describe their current state, and they’re often represented by records in databases, forming the foundation for operational data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events&lt;/strong&gt; are records of actions or changes that occur over time, such as “customer purchases product,” “order is shipped,” or “user clicks on ad.” Events capture a point-in-time action or change and are typically structured with attributes that describe the context, like a timestamp, user ID, and details of the interaction.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When to Model for Entities&lt;/h2&gt;
&lt;p&gt;Entity-based modeling is common for systems that need to manage the current state of real-world objects. Think of it as a way to describe &amp;quot;what exists&amp;quot; at any given time. Here are some scenarios when entity modeling works well:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Operational Reporting&lt;/strong&gt;: When you need a snapshot of the current state, such as an inventory of products or a list of active users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Master Data Management (MDM)&lt;/strong&gt;: For centralizing important business data, like customers, products, and vendors, ensuring consistent information across the organization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relational Data&lt;/strong&gt;: When it’s essential to maintain relationships between entities, such as the connection between customers and orders, entity modeling helps define and enforce these relationships through foreign keys or join tables.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Design Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unique Identifiers&lt;/strong&gt;: Use primary keys to ensure each entity has a unique identifier, supporting reliable lookups and references.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attribute Consistency&lt;/strong&gt;: Define data types and constraints for each attribute to ensure data integrity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explicit Relationships&lt;/strong&gt;: Use foreign keys or association tables to explicitly model relationships between entities, making it easier to query connected data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By focusing on current states and clearly defined relationships, entity modeling enables consistent, reliable data management for applications and reporting.&lt;/p&gt;
&lt;h2&gt;When to Model for Events&lt;/h2&gt;
&lt;p&gt;Event-based modeling is beneficial when you need to track activities over time. Events provide a record of actions and changes, allowing for deeper insights into patterns, trends, and user behaviors. Here are some scenarios when event modeling works well:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Customer Journey Tracking&lt;/strong&gt;: By recording each action a customer takes: such as logging in, browsing products, or making a purchase, you can build a comprehensive view of their journey and behavior patterns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Analytics&lt;/strong&gt;: In scenarios like fraud detection or monitoring application performance, a continuous stream of events allows for timely insights and anomaly detection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;System Monitoring&lt;/strong&gt;: Capturing logs, metrics, and performance indicators from systems helps in monitoring health, diagnosing issues, and improving performance through historical trends.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Design Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Timestamps&lt;/strong&gt;: Each event should have a timestamp to establish when the action occurred, which is critical for sequencing and time-based analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unique Event IDs&lt;/strong&gt;: Use unique IDs to avoid duplicates and ensure traceability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contextual Attributes&lt;/strong&gt;: Include relevant attributes, such as user or session IDs, to tie events back to the entities involved, enriching the analysis with contextual data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Event modeling enables a time-series approach, capturing the &amp;quot;when&amp;quot; and &amp;quot;what happened,&amp;quot; allowing businesses to understand user behavior and trends in a dynamic, ongoing way.&lt;/p&gt;
&lt;h2&gt;Modeling Events vs. Entities: Key Differences&lt;/h2&gt;
&lt;p&gt;Understanding the core differences between event and entity modeling can help clarify when to use each approach. While entities capture the current state of key objects, events capture the actions that affect those objects over time. Here’s a quick comparison:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Entity Model&lt;/th&gt;
&lt;th&gt;Event Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Purpose&lt;/td&gt;
&lt;td&gt;Describe current state of objects&lt;/td&gt;
&lt;td&gt;Capture actions or changes over time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical Attributes&lt;/td&gt;
&lt;td&gt;Static (e.g., name, type, category)&lt;/td&gt;
&lt;td&gt;Dynamic (e.g., timestamp, event type, status)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Granularity&lt;/td&gt;
&lt;td&gt;One row per entity&lt;/td&gt;
&lt;td&gt;Multiple rows per entity, one per event&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example Use Case&lt;/td&gt;
&lt;td&gt;Product catalog, customer list&lt;/td&gt;
&lt;td&gt;Clickstream, transaction history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Evolution&lt;/td&gt;
&lt;td&gt;Slow-changing, handles updates infrequently&lt;/td&gt;
&lt;td&gt;Flexible, new event types can be added easily&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;By differentiating between the stable attributes of entities and the dynamic, timestamped nature of events, you can create a model that reflects both the current state and the historical actions within your data ecosystem. This approach supports a more comprehensive analysis, enabling better decision-making and richer insights.&lt;/p&gt;
&lt;h2&gt;Blending Events and Entities for Comprehensive Analysis&lt;/h2&gt;
&lt;p&gt;In many systems, combining event and entity models provides a more complete picture of both the current state and historical actions. For instance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;E-commerce Analytics&lt;/strong&gt;: Track events like “user clicks,” “adds to cart,” and “makes a purchase” while also modeling entities like “user,” “product,” and “order.” Together, these models offer insights into customer behavior and product popularity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Behavior Analysis&lt;/strong&gt;: In social media platforms, users are entities, while their actions (such as likes, comments, and shares) are events. Combining these perspectives enables understanding of both user attributes and engagement patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Approach to Combined Modeling&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Star Schema&lt;/strong&gt;: Use a star schema with entities as dimensions and events as fact tables to simplify relational analysis. Entities serve as the dimensions describing core objects, while events are stored in a central fact table to represent actions over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Layered Storage in Data Lakehouses&lt;/strong&gt;: For a data lakehouse, consider storing events as time-series data and entities as slowly changing dimensions. This setup allows flexible querying and joins as needed, balancing real-time and historical analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By blending event and entity models, you can leverage the strengths of each: entities for understanding the present and events for tracking change, creating a more robust foundation for both operational and analytical use cases.&lt;/p&gt;
&lt;h2&gt;Practical Tips for Event and Entity Modeling&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define Clear Boundaries&lt;/strong&gt;: Distinguish between data that represents &amp;quot;what exists&amp;quot; (entities) and data that represents &amp;quot;what happens&amp;quot; (events). For instance, customer information belongs to an entity model, while purchase transactions are better suited to an event model.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Schema-On-Read for Events&lt;/strong&gt;: Event data often benefits from a schema-on-read approach, especially in data lakes, where schemas are applied at query time. This flexibility allows you to adjust schema requirements as new events or attributes are introduced.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition and Index Event Data&lt;/strong&gt;: As event data grows rapidly, partitioning by time (such as by day or month) and indexing on frequently queried fields (like timestamps or user IDs) can significantly improve query performance, particularly for time-series analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consider Data Retention Policies&lt;/strong&gt;: Define how long you need to retain event versus entity data. Events can accumulate quickly and might only need to be stored for a set period, whereas entities may require long-term storage for operational consistency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handle Schema Evolution Carefully&lt;/strong&gt;: Plan for schema evolution in both event and entity models to avoid compatibility issues. This is especially important when adding or modifying attributes over time, ensuring consistency in historical and current data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By applying these tips, you can build data models that are flexible, efficient, and scalable, supporting both immediate and future analytics needs.&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Both events and entities have unique roles in data modeling, and understanding when to use each is crucial for building effective data platforms. Entity models help capture the current state of essential business objects, while event models record the actions and changes that occur over time. Together, they enable a more comprehensive view of both the &amp;quot;what&amp;quot; and the &amp;quot;when&amp;quot; of your data, supporting a range of use cases from real-time analytics to historical trend analysis.&lt;/p&gt;
&lt;p&gt;In many cases, a hybrid approach that combines events and entities will offer the most value, providing a snapshot of the present state alongside a timeline of interactions. This dual perspective not only strengthens operational reporting but also deepens insights into user behaviors and business processes.&lt;/p&gt;
&lt;p&gt;By understanding these fundamental modeling strategies and applying best practices, you can design a data model that is both adaptable and insightful - one that meets the analytical needs of today and scales with the demands of tomorrow.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 01 - An Introduction</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-01/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-01/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Managing and processing large datasets efficiently is crucial for many organizations. One of the key factors in data efficiency is the format in which data is stored and retrieved. Among the numerous file formats available, &lt;strong&gt;Apache Parquet&lt;/strong&gt; has emerged as a popular choice, particularly in big data and cloud-based environments. But what exactly is the Parquet file format, and why is it so widely adopted? In this post, we’ll introduce you to the key concepts behind Parquet, its structure, and why it has become a go-to solution for data engineers and analysts alike.&lt;/p&gt;
&lt;h2&gt;What is Parquet?&lt;/h2&gt;
&lt;p&gt;Parquet is an &lt;strong&gt;open-source, columnar storage file format&lt;/strong&gt; designed for efficient data storage and retrieval. Unlike row-based formats (like CSV or JSON), Parquet organizes data by columns rather than rows, making it highly efficient for analytical workloads. However, Parquet is used with various processing engines such as Apache Spark, Dremio, and Presto, and it works seamlessly with cloud platforms like AWS S3, Google Cloud Storage, and Azure.&lt;/p&gt;
&lt;h2&gt;Why Use Parquet?&lt;/h2&gt;
&lt;p&gt;The design of Parquet provides several key benefits that make it ideal for large-scale data processing:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Compression&lt;/strong&gt;&lt;br&gt;
Parquet’s columnar format allows for highly efficient compression. Since data is stored by column, similar values are grouped together, making compression algorithms far more effective compared to row-based formats. This can significantly reduce the storage footprint of your datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Faster Queries&lt;/strong&gt;&lt;br&gt;
Columnar storage enables faster query execution for analytical workloads. When executing a query, Parquet allows data processing engines to scan only the columns relevant to the query, rather than reading the entire dataset. This reduces the amount of data that needs to be read, resulting in faster query times.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;&lt;br&gt;
Parquet supports schema evolution, which means you can modify the structure of your data (e.g., adding or removing columns) without breaking existing applications. This flexibility is particularly useful in dynamic environments where data structures evolve over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cross-Platform Compatibility&lt;/strong&gt;&lt;br&gt;
Parquet is compatible with multiple languages and tools, including Python, Java, C++, and many data processing frameworks. This makes it an excellent choice for multi-tool environments where data needs to be processed by different systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;The Difference Between Row-Based and Columnar Formats&lt;/h2&gt;
&lt;p&gt;To fully understand the benefits of Parquet, it&apos;s essential to grasp the distinction between row-based and columnar file formats.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row-based formats&lt;/strong&gt; store all the fields of a record together in sequence. Formats like CSV or JSON are row-based. These are suitable for transactional systems where entire rows need to be read and written frequently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Columnar formats&lt;/strong&gt;, like Parquet, store each column of a dataset together. This approach is advantageous for analytical workloads, where operations like aggregations or filters are performed on individual columns.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, in a dataset with millions of rows and many columns, if you only need to perform analysis on one or two columns, Parquet allows you to read just those columns, avoiding the need to scan the entire dataset.&lt;/p&gt;
&lt;h2&gt;Key Features of Parquet&lt;/h2&gt;
&lt;p&gt;Parquet is packed with features that make it well-suited for a wide range of data use cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Columnar Storage&lt;/strong&gt;: As mentioned, the format stores data column-wise, making it ideal for read-heavy, analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Compression&lt;/strong&gt;: Parquet supports multiple compression algorithms (Snappy, Gzip, Brotli) that significantly reduce data size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Splittable Files&lt;/strong&gt;: Parquet files are splittable, meaning large files can be divided into smaller chunks for parallel processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rich Data Types&lt;/strong&gt;: Parquet supports complex nested data types, such as arrays, structs, and maps, allowing for flexible schema designs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When to Use Parquet&lt;/h2&gt;
&lt;p&gt;Parquet is an excellent choice for scenarios where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You have large datasets that need to be processed for analytics.&lt;/li&gt;
&lt;li&gt;Your queries often target specific columns in a dataset rather than entire rows.&lt;/li&gt;
&lt;li&gt;You need efficient compression to reduce storage costs.&lt;/li&gt;
&lt;li&gt;You&apos;re working in a distributed data environment, such as Hadoop, Spark, or cloud-based data lakes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, Parquet may not be ideal for small, frequent updates or transactional systems where row-based formats are more suitable.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The Apache Parquet file format is a powerful tool for efficiently storing and querying large datasets. With its columnar storage design, Parquet provides superior compression, faster query execution, and flexibility through schema evolution. These advantages make it a preferred choice for big data processing and cloud environments.&lt;/p&gt;
&lt;p&gt;In the upcoming parts of this blog series, we’ll dive deeper into Parquet’s architecture, how it handles compression, encoding, and how you can work with Parquet in various tools like Python, Spark, and Dremio.&lt;/p&gt;
&lt;p&gt;Stay tuned for the next post in this series: &lt;strong&gt;Parquet&apos;s Columnar Storage Model&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 02 - Parquet&apos;s Columnar Storage Model</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-02/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-02/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the first post of this series, we introduced the Apache Parquet file format and touched upon one of its key features - columnar storage. Now, we’ll take a deeper dive into what this columnar storage model is, how it works, and why it’s so efficient for big data analytics. Understanding Parquet&apos;s columnar architecture is key to leveraging its full potential in optimizing data storage and query performance.&lt;/p&gt;
&lt;h2&gt;What is Columnar Storage?&lt;/h2&gt;
&lt;p&gt;Columnar storage means that instead of storing rows of data together, the data for each column is stored separately. This might seem counterintuitive at first, but it has major benefits for certain types of workloads, particularly those where you’re analyzing or aggregating specific columns rather than accessing entire rows.&lt;/p&gt;
&lt;p&gt;In a row-based format like CSV or JSON, data is written and read one row at a time. Each row stores all fields together in sequence. On the other hand, in a columnar format like Parquet, all values for a single column are stored together. For instance, if you have a dataset with columns for &lt;code&gt;Name&lt;/code&gt;, &lt;code&gt;Age&lt;/code&gt;, and &lt;code&gt;Salary&lt;/code&gt;, all the values for the &lt;code&gt;Name&lt;/code&gt; column are stored in one block, all the values for the &lt;code&gt;Age&lt;/code&gt; column are stored in another, and so on.&lt;/p&gt;
&lt;h2&gt;Why is Columnar Storage Efficient?&lt;/h2&gt;
&lt;p&gt;The efficiency of columnar storage becomes clear when we consider the type of operations typically performed on large datasets in analytics. Let’s break down the advantages.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Faster Query Performance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Columnar storage shines when your queries focus on a subset of columns. For example, if you want to calculate the average salary of employees in a large dataset, Parquet allows you to scan just the &lt;code&gt;Salary&lt;/code&gt; column without reading the entire dataset.&lt;/p&gt;
&lt;p&gt;In a row-based format, even though you&apos;re only interested in one column, the system has to read all the data in every row to retrieve the values for that column. This results in a lot of unnecessary I/O operations, slowing down query performance. With Parquet, only the columns you need are read, making queries significantly faster.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Better Compression&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Parquet&apos;s columnar structure also improves compression. Since similar data types are stored together, compression algorithms can be applied more effectively. For example, if a column contains repeated values or data that follows a consistent pattern (such as dates or integers), it can be compressed more efficiently.&lt;/p&gt;
&lt;p&gt;By grouping similar values together, columnar formats enable algorithms like &lt;strong&gt;dictionary encoding&lt;/strong&gt; or &lt;strong&gt;run-length encoding&lt;/strong&gt; to achieve high compression ratios. This leads to smaller file sizes, which means reduced storage costs and faster data transfers.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Efficient Aggregation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Columnar storage is ideal for aggregation queries, such as calculating sums, averages, or counts. These types of operations often focus on specific columns. With Parquet, only the relevant columns need to be read into memory, which not only improves query speed but also reduces the overall resource usage.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Batch Processing and Parallelization&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Another benefit of Parquet’s columnar model is that it enables better parallel processing. Since columns are stored independently, data processing engines like Apache Spark can read different columns in parallel, further speeding up query execution. This makes Parquet a great fit for distributed computing environments, where parallelism is key to achieving high performance.&lt;/p&gt;
&lt;h2&gt;How Parquet Organizes Data&lt;/h2&gt;
&lt;p&gt;Understanding how Parquet organizes data internally can help you fine-tune how you store and query your datasets.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Columns and Row Groups&lt;/strong&gt;: Parquet organizes data into &lt;strong&gt;row groups&lt;/strong&gt;, which contain chunks of column data. A row group contains all the data for a subset of rows, but the data for each column is stored separately. This allows for efficient I/O when reading subsets of rows or columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pages&lt;/strong&gt;: Within each column chunk, data is further divided into &lt;strong&gt;pages&lt;/strong&gt;. Parquet uses pages to store column data more granularly, which helps optimize compression and read performance. Each page is typically a few megabytes in size, and Parquet stores statistics about the data in each page, making it easier to skip irrelevant pages during query execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Use Cases for Columnar Storage&lt;/h2&gt;
&lt;p&gt;Columnar storage formats like Parquet are most effective in the following scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Analytics-Heavy Workloads&lt;/strong&gt;: If your workload involves a lot of analytical queries (e.g., calculating averages, filtering by certain columns), columnar formats will provide significant performance gains.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Big Data Environments&lt;/strong&gt;: Parquet is commonly used in distributed data environments where large datasets are stored in cloud data lakes (e.g., AWS S3, Google Cloud Storage). It works seamlessly with frameworks like Apache Spark and Presto, which are built to process data at scale.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Warehousing&lt;/strong&gt;: When designing data warehouses, storing data in Parquet allows you to run complex analytical queries efficiently while reducing storage costs due to Parquet’s high compression.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When Not to Use Columnar Storage&lt;/h2&gt;
&lt;p&gt;While columnar storage offers significant advantages for read-heavy, analytical workloads, it may not be the best option for all use cases. For example, &lt;strong&gt;transactional systems&lt;/strong&gt; that involve frequent, small updates to data (like an online store&apos;s transaction log) may perform better with row-based formats, which are optimized for write-heavy operations. In such cases, the overhead of reading and writing data in columnar format may outweigh its benefits.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Parquet’s columnar storage model is what makes it a powerful tool for big data analytics. By organizing data by columns, Parquet allows for faster query performance, better compression, and more efficient aggregation. It’s designed to excel in environments where read-heavy workloads dominate and when your queries often target specific columns rather than entire datasets.&lt;/p&gt;
&lt;p&gt;In the next blog post, we’ll dive deeper into the &lt;strong&gt;file structure&lt;/strong&gt; of Parquet, exploring how data is organized into row groups, pages, and columns to optimize both storage and retrieval.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 3: &lt;strong&gt;Parquet File Structure: Pages, Row Groups, and Columns&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-03/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-03/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous post, we explored the benefits of Parquet’s columnar storage model. Now, let’s delve deeper into the internal structure of a Parquet file. Understanding how Parquet organizes data into &lt;strong&gt;pages&lt;/strong&gt;, &lt;strong&gt;row groups&lt;/strong&gt;, and &lt;strong&gt;columns&lt;/strong&gt; will give you valuable insights into how Parquet achieves its efficiency in storage and query execution. This knowledge will also help you make informed decisions when working with Parquet files in your data pipelines.&lt;/p&gt;
&lt;h2&gt;The Hierarchical Structure of Parquet&lt;/h2&gt;
&lt;p&gt;Parquet uses a hierarchical structure to store data, consisting of three key components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Row Groups&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Columns&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pages&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These components work together to enable Parquet’s ability to store large datasets while optimizing for efficient read and write operations.&lt;/p&gt;
&lt;h3&gt;1. Row Groups&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;row group&lt;/strong&gt; is a horizontal partition of data in a Parquet file. It contains all the column data for a subset of rows. Think of a row group as a container that holds the data for a chunk of rows. Each row group can be processed independently, allowing Parquet to perform parallel processing and read specific sections of the data without needing to load the entire dataset into memory.&lt;/p&gt;
&lt;h4&gt;Why Row Groups Matter&lt;/h4&gt;
&lt;p&gt;Row groups are crucial for performance. When querying data, especially in distributed systems like Apache Spark or Dremio, the ability to read only the row groups relevant to a query greatly improves efficiency. By splitting the dataset into row groups, Parquet minimizes the amount of data scanned during query execution, reducing both I/O and compute costs.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row Group Size&lt;/strong&gt;: A typical row group size is set based on the expected query pattern and memory limitations of your processing engine. A smaller row group size allows for more parallelism, but increases the number of read operations. A larger row group size reduces the number of I/O operations but may increase memory usage during query execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Columns Within Row Groups&lt;/h3&gt;
&lt;p&gt;Within each row group, the data is stored column-wise. Each column in a row group is called a &lt;strong&gt;column chunk&lt;/strong&gt;. These column chunks hold the actual data values for each column in that row group.&lt;/p&gt;
&lt;p&gt;The columnar organization of data within row groups allows Parquet to take advantage of &lt;strong&gt;columnar compression&lt;/strong&gt; and query optimization techniques. As we mentioned in the previous blog, Parquet can skip reading entire columns that aren’t relevant to a query, further improving performance.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column Compression&lt;/strong&gt;: Since similar data types are stored together in a column chunk, Parquet can apply compression techniques such as &lt;strong&gt;dictionary encoding&lt;/strong&gt; or &lt;strong&gt;run-length encoding&lt;/strong&gt;, which work particularly well on columns with repeated values or patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Pages: The Smallest Unit of Data&lt;/h3&gt;
&lt;p&gt;Within each column chunk, data is further divided into &lt;strong&gt;pages&lt;/strong&gt;, which are the smallest unit of data storage in Parquet. Pages help break down column chunks into more manageable sizes, making data more accessible and enabling better compression.&lt;/p&gt;
&lt;p&gt;There are two types of pages in Parquet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Pages&lt;/strong&gt;: These contain the actual values for a column.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Index Pages&lt;/strong&gt;: These store metadata such as min and max values for a range of data, which can be used for filtering during query execution. By storing statistics about the data, Parquet can skip reading entire pages that don’t match the query, speeding up execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Page Size and Its Impact&lt;/h4&gt;
&lt;p&gt;The page size in a Parquet file plays an important role in balancing read and write performance. Larger pages reduce the overhead of managing metadata but may lead to slower reads if the page contains irrelevant data. Smaller pages provide better granularity for skipping irrelevant data during queries, but they come with higher metadata overhead.&lt;/p&gt;
&lt;p&gt;By default, Parquet sets the page size to a few megabytes, but this can be configured based on the specific needs of your workload.&lt;/p&gt;
&lt;h2&gt;The Role of Metadata in Parquet Files&lt;/h2&gt;
&lt;p&gt;Parquet files also store extensive metadata at multiple levels (file, row group, and page). This metadata contains useful information, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column statistics&lt;/strong&gt;: Min, max, and null counts for each column.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression schemes&lt;/strong&gt;: The compression algorithm used for each column chunk.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema&lt;/strong&gt;: The structure of the data, including data types and field names.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This metadata plays a crucial role in query optimization. For example, the column statistics allow query engines to skip row groups or pages that don’t contain data relevant to the query, significantly improving query performance.&lt;/p&gt;
&lt;h3&gt;File Metadata&lt;/h3&gt;
&lt;p&gt;At the file level, Parquet stores global metadata that describes the overall structure of the file, such as the number of row groups, the file schema, and encoding information for each column.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Footer&lt;/strong&gt;: Parquet stores this file-level metadata in the footer of the file, which allows data processing engines to quickly read the structure of the file without scanning the entire dataset. This structure ensures that the metadata is accessible without having to read the entire file first, enabling fast schema discovery and data exploration.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Row Group Metadata&lt;/h3&gt;
&lt;p&gt;Each row group also has its own metadata, which describes the columns it contains, the number of rows, and statistics for each column chunk. This enables efficient querying by allowing Parquet readers to filter out row groups that don’t meet the query conditions.&lt;/p&gt;
&lt;h2&gt;Optimizing Parquet File Structure&lt;/h2&gt;
&lt;p&gt;When working with Parquet files, optimizing the structure of your files based on the expected query patterns can lead to better performance. Here are some tips:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Row Group Size&lt;/strong&gt;: Adjust the row group size based on the memory capacity of your processing engine. If your engine has limited memory, smaller row groups might help avoid memory issues. Larger row groups can be beneficial when you need to minimize I/O operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Page Size&lt;/strong&gt;: Tuning the page size can improve compression and query performance. Smaller page sizes are better for queries that involve filters, as they allow more granular data skipping.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compression and Encoding&lt;/strong&gt;: Selecting the right compression algorithm and encoding scheme for your data type can make a significant difference in file size and query speed. For example, dictionary encoding is a good choice for columns with many repeated values.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The hierarchical structure of Parquet files: organized into row groups, columns, and pages, enables efficient storage and fast data access. By organizing data this way, Parquet minimizes unnecessary reads and maximizes the potential for parallel processing and compression.&lt;/p&gt;
&lt;p&gt;Understanding how these components interact helps you optimize your data storage and querying processes, ensuring that your data pipelines run as efficiently as possible.&lt;/p&gt;
&lt;p&gt;In the next blog post, we’ll explore &lt;strong&gt;schema evolution&lt;/strong&gt; in Parquet, diving into how Parquet handles changes in data structures over time and why this flexibility is key in dynamic data environments.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 4: &lt;strong&gt;Schema Evolution in Parquet&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 04 - Schema Evolution in Parquet</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-04/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-04/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When working with large datasets, schema changes: whether it’s adding new fields, modifying data types, or removing columns, are inevitable. This is where &lt;strong&gt;schema evolution&lt;/strong&gt; comes into play. In this post, we’ll dive into how Parquet handles schema changes and why this flexibility is essential in dynamic data environments. We’ll also explore how Parquet&apos;s schema evolution compares to other file formats and the practical implications for data engineers.&lt;/p&gt;
&lt;h2&gt;What is Schema Evolution?&lt;/h2&gt;
&lt;p&gt;In data management, a &lt;strong&gt;schema&lt;/strong&gt; defines the structure of your data, including the types, names, and organization of fields in a dataset. &lt;strong&gt;Schema evolution&lt;/strong&gt; refers to the ability to handle changes in the schema over time without breaking compatibility with the data that’s already stored. In other words, schema evolution allows you to modify the structure of your dataset without needing to rewrite or discard existing data.&lt;/p&gt;
&lt;p&gt;In Parquet, schema evolution is supported in a way that maintains backward and forward compatibility, allowing applications to continue reading data even when the schema changes. This is particularly useful in situations where data models evolve as new features are added, or as datasets are refined.&lt;/p&gt;
&lt;h2&gt;How Schema Evolution Works in Parquet&lt;/h2&gt;
&lt;p&gt;Parquet’s ability to handle schema evolution is one of its key advantages. When a Parquet file is written, the schema of the data is embedded in the file’s metadata. This schema is checked when data is read, ensuring that any discrepancies between the stored data and the expected structure are handled gracefully.&lt;/p&gt;
&lt;h3&gt;Common Schema Evolution Scenarios&lt;/h3&gt;
&lt;p&gt;Here are some common schema evolution scenarios and how Parquet handles them:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Adding New Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;One of the most common schema changes is the addition of new columns to a dataset. For example, imagine you have a Parquet file that originally contains the columns &lt;code&gt;Name&lt;/code&gt;, &lt;code&gt;Age&lt;/code&gt;, and &lt;code&gt;Salary&lt;/code&gt;. Later, you decide to add a &lt;code&gt;Department&lt;/code&gt; column.&lt;/p&gt;
&lt;p&gt;In this case, Parquet handles the new column without any issues. Older Parquet files that do not have the &lt;code&gt;Department&lt;/code&gt; column will simply read that column as &lt;code&gt;null&lt;/code&gt; when queried. This is known as &lt;strong&gt;forward compatibility&lt;/strong&gt;, where the old data remains readable even after the schema has been updated.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Removing Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In some cases, you may want to remove a column that is no longer relevant. If you remove a column from the schema, Parquet will continue to read the old data, but the removed column will not be included in queries. This is known as &lt;strong&gt;backward compatibility&lt;/strong&gt;, meaning that even though the schema has changed, the old data can still be accessed.&lt;/p&gt;
&lt;p&gt;However, be cautious when removing columns, as some downstream applications or queries may still rely on that data. Parquet ensures that no data is lost, but the removed column will no longer appear in new data written after the schema change.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Changing Data Types&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Changing the data type of an existing column can be trickier, but Parquet provides mechanisms to handle this scenario. If you change the data type of a column (for example, changing an &lt;code&gt;int&lt;/code&gt; column to a &lt;code&gt;float&lt;/code&gt;), Parquet ensures that old data can still be read by performing necessary type conversions.&lt;/p&gt;
&lt;p&gt;While this approach preserves compatibility, it&apos;s important to note that changing data types can sometimes lead to unexpected results in queries, especially if precision is lost during conversion. It&apos;s always a good practice to carefully consider the implications of changing data types.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Renaming Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Renaming a column is another common schema change. Parquet does not natively support renaming columns, but you can achieve this by adding a new column with the desired name and removing the old column. As a result, the renamed column will appear as a new addition in the schema, and older files will treat it as a missing column (reading it as &lt;code&gt;null&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;While this is not true &amp;quot;schema evolution&amp;quot; in the traditional sense, it is a common workaround in systems that rely on Parquet.&lt;/p&gt;
&lt;h3&gt;5. &lt;strong&gt;Reordering Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In Parquet, the order of columns in the schema does not affect the ability to read the data. This means that if you change the order of columns, Parquet will still be able to read the file without any issues. Column order is not enforced when querying, allowing flexibility in how data is structured.&lt;/p&gt;
&lt;h2&gt;Schema Evolution in Other Formats&lt;/h2&gt;
&lt;p&gt;Compared to other file formats like CSV or Avro, Parquet’s schema evolution capabilities are particularly robust:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CSV&lt;/strong&gt;: Since CSV lacks a formal schema definition, it doesn’t support schema evolution. If the structure of your CSV file changes, you’ll need to rewrite the entire file or deal with errors when parsing the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Avro&lt;/strong&gt;: Like Parquet, Avro supports schema evolution. However, Avro focuses on row-based storage, making it more suitable for transactional systems than analytical workloads. Parquet’s columnar nature makes it more efficient for large-scale analytics, particularly when the schema evolves over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ORC&lt;/strong&gt;: ORC, another columnar storage format, also supports schema evolution. However, Parquet is generally considered more flexible and is widely used in a variety of data processing systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Best Practices for Schema Evolution&lt;/h2&gt;
&lt;p&gt;Here are a few best practices to follow when working with schema evolution in Parquet:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Plan for Schema Changes Early&lt;/strong&gt;&lt;br&gt;
It’s always a good idea to anticipate potential schema changes when designing your data models. Adding new columns or changing data types is easier to manage if your data model is flexible from the start.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Nullable Fields&lt;/strong&gt;&lt;br&gt;
Adding new columns to a dataset is one of the most common schema changes. By making new fields nullable, you ensure that old data remains compatible with the updated schema.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Test Schema Changes in Staging Environments&lt;/strong&gt;&lt;br&gt;
Before deploying schema changes to production, test them in a staging environment. This allows you to catch potential issues related to backward or forward compatibility before they impact production systems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Document Schema Changes&lt;/strong&gt;&lt;br&gt;
Keep detailed documentation of schema changes, especially if you are working in a team. This ensures that everyone understands the evolution of the data model and how to handle older versions of the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Leverage Data Catalogs&lt;/strong&gt;&lt;br&gt;
Using a data catalog or schema registry can help manage schema evolution across multiple Parquet files and datasets. Tools like Apache Hive Metastore or Nessie Catalog allow you to track schema versions and ensure compatibility.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Schema evolution is a powerful feature of the Parquet file format, enabling data engineers to adapt to changing data models without losing compatibility with existing datasets. By supporting the addition, removal, and modification of columns, Parquet provides flexibility and ensures that data remains accessible even as it evolves.&lt;/p&gt;
&lt;p&gt;Understanding how Parquet handles schema evolution allows you to build data pipelines that are resilient to change, helping you future-proof your data architecture.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore the various &lt;strong&gt;compression techniques&lt;/strong&gt; used in Parquet and how they help reduce file sizes while improving query performance.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 5: &lt;strong&gt;Compression Techniques in Parquet&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 05 - Compression Techniques in Parquet</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-05/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-05/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One of the key benefits of using the Parquet file format is its ability to compress data efficiently, reducing storage costs while maintaining fast query performance. Parquet’s columnar storage model enables highly effective compression, as data of the same type is stored together, allowing compression algorithms to work more effectively. In this post, we’ll explore the various &lt;strong&gt;compression techniques&lt;/strong&gt; supported by Parquet, how they work, and how to choose the right one for your data.&lt;/p&gt;
&lt;h2&gt;Why Compression Matters&lt;/h2&gt;
&lt;p&gt;Compression is crucial for managing large datasets. By reducing the size of the data on disk, compression not only saves storage space but also improves query performance by reducing the amount of data that needs to be read from disk and transferred over networks.&lt;/p&gt;
&lt;p&gt;Parquet’s columnar storage format further enhances the efficiency of compression by storing similar data together, which often results in higher compression ratios than row-based formats. But not all compression algorithms are created equal - different techniques have varying impacts on file size, read/write performance, and CPU usage.&lt;/p&gt;
&lt;h2&gt;Compression Algorithms Supported by Parquet&lt;/h2&gt;
&lt;p&gt;Parquet supports several widely-used compression algorithms, each with its own strengths and weaknesses. Here are the main compression options you can use when writing Parquet files:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Snappy&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Snappy&lt;/strong&gt; is one of the most popular compression algorithms used in Parquet due to its speed and reasonable compression ratio. It was developed by Google to provide a fast and lightweight compression method that is optimized for both speed and efficiency.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Fast compression and decompression, making it ideal for real-time queries and analytics workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Provides a moderate compression ratio compared to other algorithms, meaning that it may not reduce file sizes as much as more aggressive compression methods.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Snappy is a good choice when you prioritize performance and need to process data quickly, especially for interactive queries where speed is more important than achieving the smallest file size.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Gzip&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Gzip&lt;/strong&gt; is a compression algorithm known for providing a high compression ratio, but it is slower than Snappy when it comes to both compressing and decompressing data. It is widely used in systems where saving storage space is a priority.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Provides better compression ratios compared to Snappy, resulting in smaller file sizes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Slower to compress and decompress data, making it less suitable for time-sensitive or interactive queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Gzip is a good option when you need to reduce storage costs significantly and query performance is less of a concern, such as for archiving data or when working with large, infrequently accessed datasets.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Brotli&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Brotli&lt;/strong&gt; is a newer compression algorithm developed by Google that offers even higher compression ratios than Gzip, with better performance. It is increasingly used in scenarios where both file size and decompression speed are important.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Higher compression ratios than Gzip and better decompression speed, making it a good balance between file size reduction and read performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Slower to compress data compared to Snappy or Gzip, but faster to decompress than Gzip.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Brotli is an excellent choice for compressing large datasets where both read performance and storage efficiency are important, such as in data lakes or cloud storage systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Zstandard (ZSTD)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Zstandard (ZSTD)&lt;/strong&gt; is a modern compression algorithm that provides high compression ratios with fast decompression speeds. ZSTD has gained popularity in recent years due to its versatility and ability to be tuned for both speed and compression ratio.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Provides a very good balance between compression speed, decompression speed, and file size reduction. ZSTD can be adjusted to favor either speed or compression ratio based on specific requirements.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Requires more configuration compared to simpler algorithms like Snappy or Gzip.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: ZSTD is ideal for scenarios where you need high compression ratios and fast decompression, such as for optimizing storage in data lakes while maintaining fast query performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;LZO&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;LZO&lt;/strong&gt; is another lightweight compression algorithm that focuses on fast decompression and is often used in real-time processing systems. However, it generally provides lower compression ratios compared to other algorithms like Gzip or Brotli.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Very fast decompression, making it suitable for real-time analytics and streaming data processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Lower compression ratios, which can result in larger file sizes compared to other algorithms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: LZO is a good choice when you need extremely fast data access and compression is less of a concern, such as in streaming applications or low-latency analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Choosing the Right Compression Algorithm&lt;/h2&gt;
&lt;p&gt;Selecting the right compression algorithm for your Parquet files depends on your specific use case and the balance you want to achieve between compression efficiency and performance. Here are some considerations to help guide your decision:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Speed vs. File Size&lt;/strong&gt;: If your workload requires fast query performance, prioritize algorithms like Snappy or ZSTD that decompress quickly, even if they provide slightly larger file sizes. If storage space is more important, algorithms like Gzip or Brotli may be better suited due to their higher compression ratios.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Type and Repetition&lt;/strong&gt;: Some compression algorithms work better on certain data types. For example, dictionary encoding combined with Gzip or Brotli works well on columns with many repeated values. Snappy or LZO might be better for columns with highly variable data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Costs&lt;/strong&gt;: For workloads where storage costs are a primary concern (e.g., archiving large datasets), Gzip and Brotli will provide the smallest file sizes, which can lead to significant cost savings in cloud storage environments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-Time Processing&lt;/strong&gt;: For real-time analytics or systems where low-latency access to data is critical, Snappy or LZO should be the preferred options due to their fast decompression speeds.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Combining Compression with Encoding&lt;/h2&gt;
&lt;p&gt;In addition to choosing a compression algorithm, Parquet allows you to pair compression with various encoding techniques, such as &lt;strong&gt;dictionary encoding&lt;/strong&gt; or &lt;strong&gt;run-length encoding (RLE)&lt;/strong&gt;. This combination can further optimize storage efficiency, especially for columns with repetitive values.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dictionary Encoding&lt;/strong&gt;: Works well with columns that contain many repeated values, like categorical data. Pairing dictionary encoding with Gzip or ZSTD can lead to significant reductions in file size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run-Length Encoding (RLE)&lt;/strong&gt;: This encoding is particularly useful for columns with consecutive repeated values, such as timestamps or sequences. Combining RLE with a high-compression algorithm like Brotli can achieve very high compression ratios.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Compression is a critical aspect of managing large datasets, and Parquet’s support for multiple compression algorithms allows you to optimize your data storage and processing based on the specific needs of your workload. Whether you prioritize query performance with Snappy or aim for maximum storage efficiency with Gzip or Brotli, Parquet’s flexibility ensures that you can strike the right balance between speed and file size.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore &lt;strong&gt;encoding techniques&lt;/strong&gt; in Parquet, diving deeper into how encoding works and how it complements compression for efficient data storage.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 6: &lt;strong&gt;Encoding in Parquet: Optimizing for Storage&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 06 - Encoding in Parquet | Optimizing for Storage</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-06/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-06/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the last blog, we explored the various compression techniques supported by Parquet to reduce file size and improve query performance. But compression alone isn’t enough to maximize storage efficiency. Parquet also utilizes &lt;strong&gt;encoding techniques&lt;/strong&gt; to further optimize how data is stored, especially for columns with repetitive or predictable patterns. In this post, we’ll dive into how encoding works in Parquet, the different types of encoding it supports, and how to use them to reduce storage footprint while maintaining performance.&lt;/p&gt;
&lt;h2&gt;What is Encoding in Parquet?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Encoding&lt;/strong&gt; is the process of transforming data into a more efficient format to save space without losing information. In Parquet, encoding is applied to column data before compression. While compression algorithms focus on reducing redundancy at the byte level, encoding techniques work on the logical structure of the data, particularly for columns with repeating or predictable values.&lt;/p&gt;
&lt;p&gt;By using encoding in combination with compression, Parquet achieves smaller file sizes and faster query performance. The choice of encoding is determined by the characteristics of the data in each column. Let’s take a look at the most common encoding techniques used in Parquet.&lt;/p&gt;
&lt;h2&gt;Types of Encoding in Parquet&lt;/h2&gt;
&lt;p&gt;Parquet supports several encoding techniques, each designed for specific types of data patterns. Here are the most commonly used ones:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Dictionary Encoding&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Dictionary encoding&lt;/strong&gt; is one of the most effective techniques for columns that contain repeated values. It works by creating a dictionary of unique values and then replacing each value in the column with a reference to the dictionary. This significantly reduces the amount of data stored, especially for categorical data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: For a column that contains many repeated values (e.g., a &amp;quot;Department&amp;quot; column with repeated entries like &amp;quot;Sales,&amp;quot; &amp;quot;Marketing,&amp;quot; etc.), Parquet creates a dictionary of these unique values. Each value in the original column is then replaced with a small integer that refers to its position in the dictionary. The dictionary itself is stored once per column, making it very efficient.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Dictionary encoding is highly effective for columns with a limited number of unique values (e.g., categorical data, zip codes, or status flags).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Reduces storage size significantly for columns with repeated values, especially when paired with compression algorithms like Gzip or Brotli.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: May not be as effective for columns with a high number of unique values.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Run-Length Encoding (RLE)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Run-Length Encoding (RLE)&lt;/strong&gt; is another powerful technique for compressing columns with consecutive repeating values. It works by storing the value once along with the number of times it repeats, instead of storing the repeated value multiple times.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: If a column contains long sequences of the same value (e.g., a &amp;quot;Status&amp;quot; column where many consecutive rows have the status &amp;quot;Active&amp;quot;), RLE stores the value once and records the number of times it repeats, rather than writing the value for each row. For example, instead of storing &amp;quot;Active&amp;quot; 100 times, RLE stores &amp;quot;Active: 100&amp;quot;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: RLE is ideal for columns with consecutive repeated values, such as status flags, binary values, or sorted columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Very effective at reducing file size for columns with repeated or sorted data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Less effective on columns with highly variable data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Bit-Packing&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Bit-packing&lt;/strong&gt; is an encoding technique that reduces the number of bits used to store small integers. Instead of storing each integer as a fixed-size 32-bit or 64-bit value, bit-packing stores each integer in the smallest number of bits necessary to represent it. This is particularly useful for columns that contain small integers, such as IDs or categorical data with a limited number of categories.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: If a column contains small integer values (e.g., a column with values ranging from 0 to 10), Parquet will use only 4 bits per value instead of 32 or 64 bits. This greatly reduces the amount of space required to store the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Bit-packing is effective for columns containing small integer values, such as IDs, ratings, or categorical data with a limited range of possible values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Reduces the number of bits used for small integers, leading to smaller file sizes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Less effective for columns with large integer values or wide ranges of possible values.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Delta Encoding&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Delta encoding&lt;/strong&gt; is used to store differences between consecutive values rather than storing the full values themselves. This works well for columns where values are close together or follow a predictable pattern, such as timestamps, IDs, or monotonically increasing numbers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: Instead of storing the full value for each row, delta encoding stores the difference between each consecutive value and the previous one. For example, if a timestamp column contains values like 10, 12, 14, 16, delta encoding would store 10, 2, 2, 2, where each subsequent value is the difference from the previous one.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Delta encoding is effective for columns with ordered or predictable data patterns, such as timestamps, sequence numbers, or sorted columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Greatly reduces file size for columns with predictable patterns or ordered values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Less effective for columns with random or unordered data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Plain Encoding&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Plain encoding&lt;/strong&gt; is the default encoding method in Parquet and is used for columns where no other encoding is more effective. It simply stores the values as they are, without any additional compression or optimization.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: For columns where values vary greatly or where no pattern is detectable, plain encoding stores the values as-is. This encoding method is often used for strings, floating-point numbers, and other complex data types that do not benefit from the other encoding techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Plain encoding is used for columns where no significant reduction in size can be achieved through other encoding methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Simple and effective when no patterns or repetition exist in the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Offers no additional compression or size reduction.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Combining Encoding with Compression&lt;/h2&gt;
&lt;p&gt;The true power of Parquet comes from combining encoding with compression. For example, using &lt;strong&gt;dictionary encoding&lt;/strong&gt; for a column with many repeated values, followed by &lt;strong&gt;Gzip&lt;/strong&gt; compression, can lead to significant reductions in file size. Similarly, &lt;strong&gt;run-length encoding&lt;/strong&gt; paired with &lt;strong&gt;ZSTD&lt;/strong&gt; compression works well for columns with repeated sequences.&lt;/p&gt;
&lt;p&gt;Here are some common pairings of encoding and compression techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dictionary Encoding + Gzip&lt;/strong&gt;: Effective for categorical data or columns with repeated values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run-Length Encoding + Brotli&lt;/strong&gt;: Works well for sorted or repeating columns, such as status flags or binary values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Encoding + ZSTD&lt;/strong&gt;: Ideal for columns with ordered values, like timestamps or sequence numbers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Optimizing Encoding for Performance&lt;/h2&gt;
&lt;p&gt;While encoding can reduce file size, it’s important to balance encoding choices with query performance. Certain encoding techniques, such as dictionary encoding, can improve query speed by reducing the amount of data that needs to be scanned. However, overly aggressive encoding can sometimes lead to slower read performance if it adds too much complexity to the decoding process.&lt;/p&gt;
&lt;p&gt;Here are some tips for optimizing encoding in Parquet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Test Encoding with Your Queries&lt;/strong&gt;: Different workloads may benefit from different encoding techniques. Test how your queries perform with various encoding options to find the best balance between file size and performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Statistics to Skip Data&lt;/strong&gt;: Parquet files store column-level statistics (such as min/max values) that can help query engines skip irrelevant data. Pairing encoding with Parquet’s built-in statistics allows for faster query execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage Columnar Design&lt;/strong&gt;: Since Parquet stores data column-wise, different columns can use different encoding techniques based on their data patterns. Optimize encoding for each column based on its characteristics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Encoding is a powerful tool for optimizing storage and performance in Parquet files. By choosing the right encoding technique for each column, you can reduce file size while maintaining fast query performance. Whether you’re working with categorical data, ordered values, or repeated patterns, Parquet’s flexible encoding options allow you to tailor your data storage to fit your workload’s specific needs.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll dive into how &lt;strong&gt;metadata&lt;/strong&gt; is used in Parquet files to further optimize data retrieval and improve query efficiency.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 7: &lt;strong&gt;Metadata in Parquet: Improving Data Efficiency&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 07 - Metadata in Parquet | Improving Data Efficiency</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-07/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-07/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous posts, we’ve covered how Parquet optimizes storage through columnar storage, compression, and encoding. Now, let’s explore another essential feature that sets Parquet apart: &lt;strong&gt;metadata&lt;/strong&gt;. Metadata in Parquet plays a crucial role in improving data efficiency, enabling faster queries and optimized storage. In this post, we’ll dive into the different types of metadata stored in Parquet files, how metadata improves query performance, and best practices for leveraging metadata in your data pipelines.&lt;/p&gt;
&lt;h2&gt;What is Metadata in Parquet?&lt;/h2&gt;
&lt;p&gt;In Parquet, &lt;strong&gt;metadata&lt;/strong&gt; refers to information about the data stored within the file. This information includes things like the structure of the file (schema), statistics about the data, compression details, and more. Metadata is stored at various levels in a Parquet file: file-level, row group-level, and column-level.&lt;/p&gt;
&lt;p&gt;By storing rich metadata alongside the actual data, Parquet allows query engines to make decisions about which data to read, which rows to skip, and how to optimize query execution without scanning the entire dataset.&lt;/p&gt;
&lt;h2&gt;Types of Metadata in Parquet&lt;/h2&gt;
&lt;p&gt;Parquet files store metadata at three levels:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;File-level metadata&lt;/strong&gt;: Information about the overall file, such as schema and version information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row group-level metadata&lt;/strong&gt;: Statistics about subsets of rows (row groups), such as row count, column sizes, and compression.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column-level metadata&lt;/strong&gt;: Detailed statistics about individual columns, such as minimum and maximum values, null counts, and data types.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let’s take a closer look at each type of metadata and how it improves performance.&lt;/p&gt;
&lt;h3&gt;1. File-Level Metadata&lt;/h3&gt;
&lt;p&gt;File-level metadata describes the structure of the entire Parquet file. This includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema&lt;/strong&gt;: The schema defines the structure of the data, including column names, data types, and the hierarchical structure of nested fields.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of row groups&lt;/strong&gt;: This specifies how many row groups are stored in the file. Each row group contains data for a specific range of rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version information&lt;/strong&gt;: This indicates which version of the Parquet format was used to write the file, ensuring compatibility with different readers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;File-level metadata is stored in the &lt;strong&gt;footer&lt;/strong&gt; of the Parquet file, which means it is read first when opening the file. Query engines can use this information to understand the overall structure of the data and determine how to process it efficiently.&lt;/p&gt;
&lt;h3&gt;2. Row Group-Level Metadata&lt;/h3&gt;
&lt;p&gt;Parquet files are divided into &lt;strong&gt;row groups&lt;/strong&gt;, and each row group contains a horizontal partition of the data (i.e., a subset of rows). Row group-level metadata provides summary information about the rows contained in each row group, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row count&lt;/strong&gt;: The number of rows stored in each row group.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column chunk sizes&lt;/strong&gt;: The size of each column chunk within the row group, which is useful for estimating the cost of reading specific columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression and encoding details&lt;/strong&gt;: Information about the compression algorithm and encoding technique used for each column in the row group.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This metadata allows query engines to skip entire row groups if they’re irrelevant to the query. For example, if a query is filtering for rows where a specific column’s value falls within a certain range, the engine can skip row groups where the column’s values do not meet the filter criteria.&lt;/p&gt;
&lt;h3&gt;3. Column-Level Metadata (Statistics)&lt;/h3&gt;
&lt;p&gt;Perhaps the most powerful type of metadata in Parquet is &lt;strong&gt;column-level statistics&lt;/strong&gt;. These statistics provide detailed information about the values stored in each column and include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Minimum and Maximum Values&lt;/strong&gt;: The minimum and maximum values for each column. This allows query engines to quickly eliminate irrelevant data by skipping over row groups or pages that do not match query conditions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null Counts&lt;/strong&gt;: The number of null values in each column, which helps optimize queries that filter based on null values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distinct Count&lt;/strong&gt;: Some implementations may include distinct count metadata for columns, which can help in estimating cardinality for query optimization.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These statistics are stored both at the &lt;strong&g